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Editor's 
Introduction 



This issue presents papers on diverse 
computing topics — the Internet, 
modern Fortran language extensions 
for parallel computing, and perfor- 
mance measurement of AlphaServer 
64-bit RISC systems — each repre- 
senting an area of engineering 
strength for Digital. Also in the issue- 
is a thought-provoking paper on the 
preservation of historical computers. 

The opening paper on the Internet 
Protocol version 6 examines the status 
of today's Internet and looks toward 
its future. Digital is one of several 
companies participating in the work- 
ing groups of the Internet Engineer- 
ing Task Force on the transition to 
a new protocol. Dan Harrington, 
Jim Bound, Jack McCann, and Matt 
Thomas report what they have learned 
from designing an IPv6 prototype, 
and compare and contrast the new 
version with the existing protocol, 
IPv4. The most important difference 
between the versions — one that will 
relieve the strain on the Internet — is 
the increase in IPv6 of address size 
from 32 bits to 128 bits. The authors 
conclude with a look at future work 
in such areas as security and data link 
interfaces for ATM. 

Our next paper — an unusual one 
not only for the issue but for this 
Journal — temporarily moves the dis- 
cussion from computing's future to 
its past. Max Burnet and Bob Supnik 
argue that an understanding of com- 
puting's past is vital to its future. The 
authors present two computer preser- 
vation techniques: restoration and 
simulation. To exemplify issues in 
restoration, they review the status of 
a project to restore a large UN1BUS- 
based PDP- 1 1 system. The section 



on simulation describes the types and 
purposes of simulators and presents a 
case study of SIM, a simulator imple- 
mented in C for the study of historical 
computer architectures. 

In a paper on modern Fortran, Bill 
Cclmaster demonstrates that today's 
Fortran is a viable mainstream lan- 
guage for parallel computing. Since 
its development more than 40 years 
ago, Fortran has been extended bv 
language designers to meet the needs 
of users, particularly the needs of 
scientific/technical users who require 
mathematical expressivity and code 
optimization. Bill reviews key features 
of Fortran 90, recent efforts to stan- 
dardize parallel extensions to Fortran, 
and shared-memory parallelism. He 
includes three case studies that illus- 
trate the data parallel and single- 
program-multiple-data styles of 
programming. 

Two papers describe resting 
methodologies that resulted in lead- 
ership system performance under 
the TPC-C benchmark for a cluster 
system and for a single-node system. 
The first paper presents the evalua- 
tion of an AlphaServer 8400 5/350 
TruCluster configuration support- 
ing the Oracle Parallel Server data- 
base. Judy Piantedosi, Archana 
Sathaye, and John Shakshober dis- 
cuss the system tuning and the record- 
setting results of their work. The sec- 
ond paper, by Tareef Kawaf, John 
Shakshober, and Dave Stanley, looks 
at two optimization techniques — 
locking intrinsies and OM profile- 
based optimization — applied to a 
large database program running in 
the very large memory (VLM) envi- 
ronment on an AlphaServer 8400 



system. The results of these optimi- 
zations are significant increases in 
throughput and database-cache hit 
ratios. 

The development of AltaVista 
Forum is the subject of our final 
paper. Unlike other groupware prod- 
ucts, AltaVista Forum uses the World 
Wide Web as an infrastructure to 
facilitate the rapid development of 
collaboration applications for NT and 
UNIX systems. Dah Ming Chiu and 
Dave Griffin explain this design deci- 
sion and share their experiences with 
usability studies, an interpretive lan- 
guage (Tel) for building the toolkit, 
and the inclusion of an indexing and 
search engine. 

The next issue of the Journal will 
feature the new AlphaServer 4100 
high-performance midrange server 
system, a new implementation of 
MEMORY CHANNEL, and large- 
database technologies in the VLM 
environment. 




Jane C. Blake 
Managing Editor 
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Foreword 




AJan G. Nemeth 

Corporate Consultant 

t'NIX Architecture and Tecbnohw 

"The Internet is dying." 1 feel quite 
confident you will regularly see articles 
with this message in the industry and 
general press over the next few years. 
The message won't be as new as the 
authors of the articles might believe, 
and the work to remove the most 
frequently identified problems w as 
begun years ago within the Internet 
Engineering Task Force (IETF). 
Internet Protocol version 6 (IPv6) 
is a large family of protocols that 
form the basis of the IETF response 
to a set of problems identified in the 
early 1991s and for which the need 
is accelerated by the explosion of 
Internet usage. 

One of the major concerns about 
the current Internet is the limited 
amount of address space. The under- 
lying address for I P endpoinrs is 32 
bits wide, permitting a total of 4 bil- 
lion distinct addresses. Although this 
number seems large (and it seemed 
truly gigantic in the early 1970s when 
the width was selected), it is currently 
a real, practical barrier to current 
deployment patterns. Large users 
of Internet addresses can no longer 
get the address space they need for 
assignments. Because the Internet 



has run as a decentralized organization 
over the years, there is no effective 
central administration to support com- 
petition for scarce resources such as 
address space. Instead, the response of 
the community is to provide resources 
sufficient to keep allocation as a low- 
overhead activity. So IPv6 defines an 
address space of 1 28 bits. This cur- 
rently seems like a gigantic number! 

But limited address space is hard to 
build into a persuasive case for change. 
End users are much more likely to be 
concerned about the local problem 
of getting just "one more address," 
rather than the problems of keep 
ing the Internet as a whole alive and 
functioning. So the TPv6 design delib- 
erately incorporates a set of func- 
tionality improvements that provide 
attractive end-user capabilities. IPv6 
includes much easier schemes for 
assigning addresses, which will reduce 
the administrative burden for users 
and their network managers. IPv6 
prov ides a better framework for 
encryption and an expectation that 
it will be widely available and used. 
And IPv6 provides some systematic 
mechanisms for describing requests 
for specific quality levels in the service 
offered by the transport provider. 
These capabilities will address some 
very real, practical problems that 
do afflict individual end users of the 
In ternet. 

However, there is no expectation 
that it is acceptable to switch the set 
of Internet users to IPv6 either simul- 
taneously or even over an extended 
time period. IPv6 must intemperate 
with the current installed IPv4 pro- 
tocols for an indefinite period. This 
implies services that translate between 
the different addresses (and address 



assignment approaches that ease 
mechanical derivation of IPv6 
addresses from IPv4 addresses), 
dual protocol stacks to permit com- 
munication with both protocols 
depending on the capabilities of the 
participants in the conversation, and 
schemes to accommodate security 
mechanisms and qualify of service 
requests. 

The entirety of IPv6 represents 
a large implementation effort to 
be undertaken by many different 
organizations. The Internet repre- 
sents the largest example I know of a 
distributed computation that has sur- 
vived for 27 years. ( I date from 1969 
when the first ARPANET [Advanced 
Research Projects Agency Network] 
nodes were installed.) With a few 
notable exceptions, this computation 
has run continually, despite constant 
changes in hardware, software, implc- 
mentcrs, and operators. It has sur- 
vived explosive growth far beyond 
the designs of its originators. It has 
done so with a volunteer organiza- 
tion driving the development direc- 
tion. The community' spirit has been 
crucial ro making this work. IPv6 
is an example of that community 
at work; no one organization can 
implement it all, either at a product 
level or at a deployment level. 

The I Pv6 paper in this issue 
describes the technical design needed 
to build an IPv6 implementation 
for the core protocols under the 
Digital UNIX operating system. 
Digital has been one of the leading 
prototype builders of the design spec- 
ifications as they have evolved in the 
industry debates. At the time the 
Internet Protocol Next Generation 
(IPng) Directorate officially adopted 
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key elements of the protocol, 
Digital's implementation was 
the onlv one running to demonstrate- 
that the design was indeed feasible. 
But we don't believe that we can 
implement all the pieces ofIPv6 as a 
single company. Therefore we choose- 
to share the implementation experi- 
ence through this paper to aid others 
in their efforts to deal with the imple- 
mentation problems. We also don't 
claim completeness; the full suite of 
specifications for IPv6 is evolving, and 
the software to implement it is large. 
We fully expect that portions of our 
ultimate product offerings will be 
developed by others in the industry. 

The long-term evolution of the 
Internet captured in the IPv6 imple- 
mentation paper is but one example 
in this issue of the extent to which 
computing now has a history that 
gives us much insight into the future. 
Certainly the paper by Supnik and 
Burnet is an explicit trip through 
computing historv. The re-creation, 
both physical and logical, of comput- 
ing systems of the past can onlv help 
remind us that the artifacts we create 
have a longer life than we anticipate. 
As our programmers write new code, 
or our hardware designers produce- 
new architectural approaches, or our 
storage designers push the bound- 
aries on new media technologies, do 
they consider the imponderables of 
running these systems 25 or more- 
years in the future? The view of archi- 
vists trving to preserve this history 
reminds us of the difficult)' of preser- 
vation after the fact and of the amaz- 
ing duration of design decisions. 

The paper on the evolution of 
Fortran is yet another example of the 
rich historv of computing. Here we 



see clearly the evolution of a key 
language to accommodate the chang- 
ing patterns of svstem architectural 
designs and parallel program con- 
cepts. The computer industry fre- 
quently develops commercially 
important programs by evolution — 
the 100,000-line program that 10 
years later has become 10 million 
lines of code in an assortment of 
languages and computing stvles. 
Here the venerable Fortran (first 
introduced in 1954!) adds support 
for some of the latest approaches to 
fast system interconnect represented 
by MEMORY CHANNEL and the 
parallel architectures of clusters of 
SMP systems. 

MEMORY CHANNEL reappears in 
the paper about TPC-C performance 
on TruCluster systems. This paper, 
one of a pair on the issues of tuning 
a commercially important benchmark, 
presents an attractive model for the 
benefits in performance that can be 
derived from a verv fast interconnect 
and software structures to match. 
The performance levels achieved 
shatter world records on a bench- 
mark that has had extensive atten- 
tion and work. 

The other paper on TPC-C per- 
formance with very large memory 
( VLM) illustrates the truth of an old 
design maxim, "If memory is get- 
ting cheaper, use more of it!" When 
Digital first built a 2 -gigabyte (GB) 
memory board, it took more than 
a million dollars' worth of DRAM 
chips to populate the initial instance. 
However, memory prices have con- 
tinued to drop sharply, and todav 
over 40 percent of the AlphaServer 
8400 systems ship with 2 GB or more 
of memory. The memory prices will 



continue to come down, and the 
insights offered in this paper will help 
in understanding where additional 
memory can provide real benefits to 
customer workloads. 

The final paper in the collection is 
on the AltaVista Forum approach to 
collaboration among groups exploit- 
ing the Internet and WWW technolo- 
gies and brings us back around to the 
initial thoughts in this foreword. The 
ubiquitous nature of the Internet per- 
mits and encourages tools such as this 
that utilize computer systems in new 
ways. This approach builds on the 
fabric that we emphasized in the IPv6 
paper but sees the Internet as a tool 
and a component of a larger solution 
and shows how to exploit these capa- 
bilities to allow new ways of working. 
Using imagination and building on 
the work of others are characteristic 
of the approach taken by those who 
are catalysts in the industry. The 
paper demonstrates how easy it is to 
build a svstem that would have been 
a major project just five years ago. 
This ease of construction is a benefit 
of the programming techniques and 
infrastructure investments and a spur 
to keep doing more of it. 
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Internet Protocol 
Version 6 and the Digital 
UNIX Implementation 
Experience 



I 



Daniel T. Harrington 
James P. Bound 
John J. McCann 
Matt Thomas 



In the early 1990s, the Internet community rec- 
ognized that the current TCP/IP architecture was 
not capable of sustaining the explosive growth 
of the Internet. In July 1994, the Internet Protocol 
next generation (IPng) directorate responded to 
the problem with the Internet Protocol version 6 
(IPv6) as the replacement network layer proto- 
col. Working groups of the Internet Engineering 
Task Force (IETF) then began to build specifications 
that would address the needs for an expanded 
Internet address space, an increase in router table 
size, and new technology features. As a contrib- 
utor to these efforts. Digital has implemented 
IPv6 on the Digital UNIX platform. The primary 
goal of Digital's efforts has been to evaluate the 
technical feasibility of the proposed architecture 
and provide critical feedback to the standards 
development process in the IETF. The secondary 
goal has been to evaluate system design alter- 
natives to gain the experience needed to allow 
Digital to incorporate this new architecture into 
existing products. 



As one of its ongoing advanced development efforts in 
networking technology, Digital has built an Internet 
Protocol version 6 (IPv6) prototype for the Digital 
UNIX operating system. In this paper, we describe the 
design of the Digital UNIX IPv6 prototype and its his- 
tory relevant to the Internet Protocol next generation 
(IPng) effort in the Internet Engineering Task Force 
(IETF). We also compare its relationship with the 
existing Transmission Control Protocol/Internet 
Protocol (TCP/IP) suite. We emphasize techniques 
and technologies that were developed to accommo- 
date particular aspects of the IPv6 architecture and 
issues that required further discussion in the IETF. In 
particular, we discuss the modifications to the trans- 
port layer modules to use two distinct network layer 
protocols, along with the implications to the UNIX 
socket layer and applications. In addition, wc describe 
the new IPv6 and Internet Control Message Protocol 
(ICMP) network layer modules, including their inter- 
actions with both the data link layer and the IPv4 
protocol. Wc review the new Neighbor Discovery 
Protocol and its algorithms and give details of its 
implementation. 

To accommodate the dynamic nature of future net- 
works, IPv6 includes mechanisms to do both stateless 
and statefiil address configuration, as well as router 
discovery; wc explain the design of a user-mode 
process that implements these functions. The paper 
includes a discussion of enhancements to well-known 
IPv4 services, such as dynamic updates ro the domain 
naming service (DNS), as well as general techniques 
to support the transition of existing applications. The 
paper concludes with an overview of what we have 
learned in this project and summarizes our current sta- 
tus and future work, including efforts in nonbroadcast 
multiple access (NBjMA) data link technologies such as 
asynchonous transfer made (ATM ) and resource reser- 
vation protocols. 



Internet Protocol Next Generation 



In the early 1990s, the members of the Internet com- 
munity realized that the address space and certain 
aspects of the current TCP/IP architecture were not 
capable of sustaining the explosive growth of the 
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Internet. Within the IETF, several efforts were under- 
taken to both study and improve the use of the 32-bit 
Internet Protoeol (IPv4) addresses, as well as to iden- 
tity and replace protocols and services that would limit 
growth. The 32-bit addressing architecture in the net- 
work layer was quickly determined to be the crux of 
the problem, with both hardware and human limits 
approaching fundamental boundaries. 1 IPv4 addresses 
are unevenly allocated in blocks that are often too 
large or too small; they are also difficult to change 
within any existing network. 

When the IETF called for replacement proposals, 
Digital participated in this industry-wide effort by 
submitting white papers outlining issues and by devel- 
oping and evaluating prototypes of the various pro- 
posals. Digital also participated in the IE TF working 
groups and in the IPng directorate, which had the 
responsibility for making the ultimate decision. In Julv 
1994, the JPng directorate selected the Internet 
Protocol version 6 (IPv6) as the replacement network 
laver protocol, and IETF working groups began to 
build specifications. "The Recommendation for the 
IP Next Generation Protocol" summarizes the candi- 
dates and explains the selection of this protocol. 2 

Digital UNIX Prototype 

The current Digital UNIX IPv6 prototype project is 
Digital's most recent addition to an ongoing effort to 
develop and evaluate the competing IPng proposals. 
This began with the Simple Internet Protocol (SIP), 
which used eight octet addresses. SIP was later melded 
with another early proposal and became known as 
Simple Internet Protocol Plus (SIPP), the direct 
antecedent of IPv6. 5 The primary goal of Digital's 
efforts has been to evaluate the technical feasibility of 
the proposed architecture and prov ide feedback to the 
IETF working groups. This is critical to the standards 
development process in the IETF, which requires mul- 
tiple independent and interoperable implementations 
of a specification before it may become an Internet 
standard. An additional goal has been to ev aluate sys- 
tem design alternatives to gain the experience needed 
to allow Digital to incorporate this new architecture 
into existing products. Digital has made the prototype 
available to researchers within the company as a source 



code distribution and more recently has begun to sup- 
ply binary kits for early adopters and evaluators in the 
Internet community. As the IPv6 protocol and archi- 
tecture matures, we have begun to focus on how to 
best integrate the code into the Digital UNIX product. 

IPv6 Overview 

To understand the svstem-vvide impact of IPv6, we 
review some of its new features and contrast them with 
the IPv4 model. IPv6 is both a completely new 
network laver protocol and a major revision of the 
Internet architecture. At both levels, it builds upon 
and incorporates experiences gained with IPv4. 

Figure 1 shows the evolution of the packet format 
into the new IPv6 header. It retains some fields (ver- 
sion, source, and destination address), clarifies the role 
of others (for example, the Time To Live [TTL] Held 
is renamed the Hop Limit), and introduces new ones 
(such as Flow ID) with as yet untapped potential. The 
next header field allows for modular construction of 
complex packets: different header tvpes can be chained 
together to provide specialized functionality, includ- 
ing security and source routing. Finally, all headers are 
structured to allow 64-bit alignment, which should 
allow optimal processing both at source and destina- 
tion systems, as well as in transit. 4 

The most striking departure from IPv4 is the 
address size: it has increased from 32 bits to 128 bits. 
The IPv6 addressing architecture is rich, with prefixes 
for multicast addresses and predefined scopes for both 
unicast and multicast addresses. One special type of 
unicast address is the link-local address, which permits 
communications with only those systems directly con- 
nected on the same link. This allows a standard boot- 
strapping mechanism, so that systems can learn about 
neighbors and services before a rourable address is 
assigned to an interface. Various address assignment 
options have been defined, including hierarchical 
models based upon regional registries and service 
provider identifiers. In each case, care has been taken 
to ensure proper route aggregation, which will help 
vield more efficient backbone router performance. 

Multiple means of acquiring addresses have been 
defined for IPv6 addressing, with the goals of allowing 
flexibility through different administrative policies 
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and, perhaps more important, of demanding that net- 
work address reassignment be supported throughout 
the architecture. The two new addressing services are 
Stateless Address Autoconfiguration and the stateh.il, 
transaction -based Dynamic Host Configuration Pro- 
tocol version 6 (DHCPv6).~ s In the stateless model, 
address prefixes are learned by listening for router 
advertisement packets. Addresses are formed by com- 
bining the prefix with a link-specific token such as the 
48-bit Ethernet hardware address. In the statefi.il pro- 
cedure, hosts may rcc]ucst addresses, configuration 
information, and services from dedicated configura- 
tion servers, with routers potentially serv ing as relav 
stations during the initial phase. In both cases, the 
resulting addresses have associated lifetimes, and sys- 
tems must be prepared to both learn new addresses 
and release expired addresses. Combined with the 
ability to register updated address information with 
DNS servers, these mechanisms provide a path toward 
network renumbering, a goal that has proved difficult 
to achieve in the I Pv4 world. 

Finally, the Internet Control Message Protocol ver- 
sion 6 (ICMPv6) was developed.'' This specification 
aimed to merge the functions of two distinct IPv4 pro- 
tocols for reporting errors and status, 1CMP for uni- 
cast packet transmission and the Internet Group 
Message Protocol (IGMP) for multicast traffic. 

The messages defined in this protocol are catego- 
rized as either error or informational, with a family of 
messages in the second group used to provide the 
Neighbor Discovery Protocol. 1 " Neighbor discovery 
serves multiple purposes with the overall theme of 
providing a svstem with topological and environmen- 
tal hints. For example, link-layer address resolution, 
router discovery, destination address redirection, and 
address autoconfiguration mechanisms are all specified 
using neighbor discovcrv packet tvpes. 

Although the network laver did experience the largest 
amount of change, Figure 2 shows that the effects of 
this work touch nearlv all aspects of the Digital UNIX 
system. We point out examples of decisions made due to 
our fundamental design philosophy, which is based 
upon integration with the UNIX system framework, 
modular and extensible software, support for multiple 
operational policies, and a desire to take advantage of 
the Alpha platform without compromising portability. 

In the following sections, wc study these topics in 
depth, beginning with the network layer, then cover- 
ing the transport layer modifications and the new 
neighbor discovery algorithms. Alter that, we discuss 
address autoconfiguration mechanisms and their 
effects upon the svstem. We conclude with services 
that will be affected by the transition from IPv4 to 
IPv6 such as the socket application programming 
interface (API) and DNS. 
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Base Platform Changes 



Network Layer 

In this section, we review the processing requirements 
of the IPv6 modules, including ICMPv6, extension 
header options, and fragmentation. An carlv design 
decision was made to base the networking subsystem 
on the Berkeley Standard Distribution (BSD) 4.4 
model and code base, which allows great flexibility in 
dealing with multiple network layers." This also has 
the advantage of providing support for variablc-bit- 
length netmasks (also known as ClDR-stvlc nctmasks, 
from Classless Inter-Domain Routing), which are 
appropriate to both IPv4 and IPv6. 12 We have also 
tried to take maximum adv antage of the 64-bit Alpha 
architecture when implementing IPv6, while making 
certain that this implementation would run on 32-bit 
CPUs as well. For example, the checksum routines 
operate on 32-bit quantities (allowing the carry to 
overflow into the upper 32 bits of a 64-bit register). 
The checksum routine is also designed to allow it to be 
issued to multiple Alpha execution units, which 
remains a topic for further investigation. 

Adaptations to Existing IP and ICMP Routines 

The IPv6 and ICJVlPv6 routines are completely 
independent of the corresponding IPv4 and ICMPv4 
routines, and the processing styles have distinct differ- 
ences. In IPv6, the incoming packet is treated as being 
read-only, while the BSD IPv4 code manipulates fields 
within the IPv4 header. We also avoid unnecessary use 
of the m_pullup routine (which consolidates chained 
memory buffers into a single large buffer) because this 
could cause the packet to be ncedlesslv lost. Finally, 
instead of passing numerous arguments when calling 
from function to function, a common data structure is 
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used to store necessary data and pointers; for most 
function calls, it is only necessary to pass a pointer 
to this structure. This reduces the stack overhead and 
also yields modular and easily extensible subroutines. 

IPv6 has a dedicated interrupt processing thread, 
and received IPv6 packets are placed onto their own 
interface input queue (ifqueue). When an IPv6 packet 
is taken off the ifqueue, basic validity tests are done; 
onlv after passing them is the packet tested to see if it 
is directed to a unicast or a multicast address. 

If the packet is to a multicast address, the destina- 
tion is compared to the enabled 1 1V6 multicasts for the 
interface over w hich the packet was received. If the 
destination matches, the packet is passed up to normal 
packet processing; otherwise, a copv of the packet is 
passed to the multicast forw arder. 

Similarly, unicast packets are checked to see that the 
destination matches one of the system's addresses. In 
the special case of the packet being targeted to a link- 
local address, only the link-local address for the receiv- 
ing interface is compared. If there is an exact match, 
the packet is processed normally; otherwise, it is 
passed to the unicast packet forwarding routine. 

Header Processing 

After a packet has been matched to a local address, the 
IPv6 headers must be processed, independently of 
whether the packet is multicast or unicast. This pro- 
cessing is done in a common routine that handles all 
types of IPv6 headers. A number of actions may result 
from the verification and analysis phase, including an 
ICiMPv6 packet being sent back to the source, the 
packet being silcntlv dropped, or being forwarded to 
another node due to a source route. If none of these 
possibilities occurs, the next IPv6 header in the packet 
is processed. 

If the header is a known IPv6 header tvpe, the 
appropriate routine is called. If not, this packet is 
probably destined for another protocol module such 
as TCP, the User Datagram Protocol (UDP), or 
ICMPv6. The header type is looked up in the list of 
active protocols and passed to the matching protocol 
input routine. If no entry is found, an ICMPv6 error 
may be sent back. 

Header Options 

Since the hop-by-hop and destination node headers 
have the same format, a common routine processes 
both types. As the routine processes each option, 
it validates the option. If this fails, it checks whether 
an ICMPv6 parameter problem error should be 
sent, whether the packet should be discarded, or the 
option ignored. 

ICMPv6 Processing and Checksums 

Upon receipt of an ICMPv6 packet from a node in the 
network reporting an error or other information, it is 
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first validated for correct packet format and checksum. 
The packet is then further processed based upon its 
ICMPv6 type value. If it has an ICMPv6 error type (i.e., 
type value Jess than 128), the appropriate notifications 
are sent to the affected protocol. Neighbor discovery 
packets, which are all informational, have a number of 
additional consistency checks, and the packet is 
dropped if it fails them. After the ICMPv6 packet has 
been processed, it is also sent to any ICMPv6 raw sock- 
ets that have requested reception of that tvpe. The 
exception to this rule is an ICMPv6 echo request 
packet, which is not copied to the raw sockets. 

When an ICMPv6 echo request is received and 
validated, the ICM1V6 echo response packet is pre- 
pared. In the tvpical case, it is identical to the echo 
request except for the 1CMPv6 tvpe and checksum 
value. The exception would be an echo request sent to 
a multicast address, in which case a source address 
must also be selected. Rather than computing the 
checksum on the new packet, the received checksum is 
simply adjusted down by 1, since the sole difference 
between the two packets is the value of the ICMPv6 
tvpe fields, and ICMPv6 echo request and echo 
response rvpes differ bv 1 . 

IPv6 requires all nodes to support multicasting, 
specifically level 2 as defined in "Host Extensions for 
IP Multicasting." 11 Although this was written for IPv4, 
the same general algorithms are used for IPv6. One 
notable exception to this is that the multicast addresses 
used for neighbor solicitions and the predefined link- 
local multicasts such as all-nodes and all-routers do 
not require periodic status reports. 

Path Maximum Transmission Unit Discovery 

One of the significant differences between IPv4 and 
IPv6 concerns fragmentation. In IIY6, fragmentation 
mav be done only by the node from which a packet 
originates. Forwarders, which mav be routers or hosts 
acting upon source routing headers, are not permitted 
to fragment packets. The burden is on the originating 
node to send packets that are small enough to fit 
through all the links along the paths to their destina- 
tions, where each link type may have a different maxi- 
mum transmission unit (MTU). To ease this burden, 
IPv6 defines a minimum link MTU of 576 bytes. A 
node may use this as the upper limit on packet size and 
be assured that its packets are sufficiently small to 
reach their destinations. 

The minimum MTU of all the links in a path 
between two nodes is referred to as the path MTU. H In 
manv cases, the path MTU will exceed 576 bytes, and it 
is desirable to send the largest possible packets. IPv6 
provides a mechanism bv which a node mav discover 
a path's MTU. 15 When a forwarder cannot forward a 
packet because the packet is too large for the next hop's 
link MTU, it sends an ICMPv6 Packet Too Big (PTB) 
message back to the source of the packet. The PTB 



message contains the iMTU of the constricting link. 
The source node adjusts its packet size to fit through 
this link. 

Path iMTU information is kept on a per-destination 
basis and is stored in the routing table entry t or a given 
destination. Packets sent on that route will be sized 
according to the path iMTU value. When a PTB mes- 
sage is received, the appropriate route is updated to 
contain the new path MTU value as reported in the 
PTB message, and a timer is started. When the timer 
expires, the path MTU value is increased to the 
(known) MTU of the first hop link. This allows the 
node to detect increases in the path MTU. 

Switches are provided to disable path MTU discov- 
ery system-wide, on a per-destination basis and on 
a per-socket basis. When path MTU discovery is dis- 
abled, packets are limited to 576 bytes. 

Fragmentation 

A packet that is larger than the MTU of the path on 
which it is to be sent must be fragmented. Unlike IPv4, 
the I Pv6 header contains no fields to carry fragmenta- 
tion information. Instead, this information is carried 
in a specialized extension header, called the fragment 
header. As shown in Figure 3, the fields in the frag- 
ment header include an offset, in eight octet units, and 
an identifier common to all fragments of the original 
packet. The M (managed) flag is used to indicate inter- 
mediate fragments; the terminal fragment has the bit 
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cleared. Note that the amount of data in a fragment 
packet is derived from the total packet length. 

The first step in the fragmentation process is 
to identify the fragmentable and unfragmentable parts 
of the original packet (see Figure 4). The unfrag- 
mentable part of the packet consists of the I1V6 header 
and any extension headers that must be processed by 
each node traversed by the packet (e.g., hop-by-hop 
header, routing header). The fragment header is 
appended to the unfragmentable part. The rest of the 
packet is divided into fragments, and each fragment is 
appended to a copy of the unfragmentable part plus 
fragment header. 

When the fragment header is appended to the 
unfragmentable part, two fields in the unfragmentable 
part must be updated. First, the payload length field in 
the IPv6 header must be updated to reflect the length 
of the fragment packet. Second, the next header field 
in the last header of the unfragmentable part must be 
changed to indicate that a fragment header follows. 

A copy of the unfragmentable part is created for 
each fragment packet. As an optimization, Digital 
UNIX allows portions of a packet to be shared among 
copies of the packet, to avoid an actual data copy. As 
with IPv4, care must be taken to ensure that fields 
being updated are not contained in shared buffers. 
This is typically accomplished bv copying the portions 
that must be updated into a private memory buffer 
(mbuf). Unlike IPv4, the unfragmentable part may 
not fit in a single mbuf, and the IPv6 fragmentation 
code must be capable of handling this case. 

To reduce the possibility of fragment loss at the 
source node, all the fragment packets are built before 
any is passed to the data link for transmission. 

A question that arises here is how big should 
the fragment packets be? Should they be sized accord- 
ing to the path MTU, or should they be limited to 
576 bytes? The former yields the desirable larger 
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packers, while the latter avoids undesirable fragment 
loss (due ro the fragment packet being too big). The 
Digital UNIX IPv6 prototype supports either choice 
on a system-wide, per-destination, or pcr-socket basis. 
This is an example of separation of mechanism from 
policy, a basic guideline being used across this project. 

Reassembly 

The reassembly process reconstructs the original 
packet from fragment packets. Fragments belonging 
to the same packet are identified by a combination of 
source I P address, next header rvpc (first header of the 
fragmentable part) and fragment identifier. Individual 
fragments are queued within the network I aver until the 
original packet can be completelv reassembled, at which 
point it is passed to the appropriate protocol module. 

When all fragments have arrived, the original packet 
can be reassembled. A single copy of the unfragment- 
able part is kept, and the data from each fragment 
packet is appended. The payload length Held of the IPv6 
header is updated to reflect the length of the reassem- 
bled packet, and the next header field of the last header 
of the unffagmenrable part is restored ro reflect the first 
header in the fragmentable part. 

As with the fragmentation code, care must be taken 
so that fields being updated are nor in buffers shared 
with other copies of the packet. 

When the first fragment of a packer arrives, a timer 
is started. If the timer expires before that packet is 
complete, the fragments are discarded. If the offset 
zero fragment has been received, an ICMPv6 error 
message is sent. 

Forwarding and Routing 

If a received packet does not match one of the system's 
addresses and the system is not acting as a router, the 
packet is silentlv dropped. Otherwise, an attempt is 
made to forward the packet. The first step in forward- 
ing is to do a lookup in the routing table, the type of 
lookup depends on whether the packet contains a 
nonzero flow label. If it does, the lookup is based on 
both the source address and the flow label; otherwise 
the destination address is used. If the lookup succeeds 
and the length of the packet fits within the MTU of the 
resultant route and interface, the packet is transmitted 
to the next hop as indicated by the route. Otherwise 
an appropriate ICMPv6 error is sent back to the origi- 
nating node. 

Tunnels 

Tunneling is a mechanism that allows packets of one 
network type ro be encapsulated and forwarded within 
a network layer packet of the same or a different type. 
IPv6 packets can be tunneled over cither IPv4 or IPv6 
nerworks, as mav IPv4 packers.'" 1 " The tunneling rou- 
tine takes as input a packet, prepends the appropriate 
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IP header for the network over which the packet will 
be tunneled, and transmits the resultant packet over 
that network. Tunnels are unidirectional; there need 
nor be a corresponding runnel in the reverse direction. 

Rather than having multiple tunnel interfaces (one 
for each possible combination of protocol Y over 
protocol X), the Digital UNIX implementation uses 
a single tunnel interface. This method was the sugges- 
tion of Keith Sklowcr of the University of California 
at Berkeley. 15 When the interface is initialized, only 
automatic tunneling of IPv6 over IPv4 is enabled. 1 " 
To configure a static tunnel, where fixed end points 
are used, a static route is added ro rhe routing tables 
with the proper destination and gateway (runnel end 
point) addresses. 

When a packer is presented to the tunnel interface, 
it looks up the route entrv of the destination address. 
The route contents tells the tunneling routine how the 
packet is to be encapsulated and forwarded. The route's 
gateway address indicates what underlying network to 
use, and the route's destination address indicates what 
rype of packer is being runneled. 

When a runneled packer is received, rhe inirial 
header is stripped and the resulting packet is placed on 
the appropriate IPv6 or IPv4 ifqueue. 

Transports 

One of the strengths of the IPng effort was the com- 
mitment to preserve the well-understood transports, 
TCP and UDP, upon which a wealth of applications 
have been built. 

The IPv6 specification calls for three particular 
requirements of up per- 1 aver protocols: 

1. The pseudoheader checksum must accommodate 
larger addresses. 

2. The maximum packet lifetime is no longer 
computed . 

3. The larger IPv6 header(s) must be taken into 
account when computing the maximum payload 
size (e.g., TCP's maximum segment size [MSS]).' 1 

In addition to these mandated modifications, we had 
to make a fundamental design choice. With two differ- 
ent network layer protocols in the system, each using a 
different size address, our design choice could have 
been to use rwo independent transport modules, one 
for each network layer. Figures 5 and 6 show the inde- 
pendent versus rhe integrated transport design options. 

Although the independent model offers an element 
of design simplicity, it wastes memory by duplicating 
each transport layer function. In rhe Digital UNIX 
implementation, these modules are implemented in 
the kernel, and duplication would be expensive. Also, 
the design and use ofa single programming interface- 
to access both sets of services would be complicated. 
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The ability to maintain, let alone extend, the eode base 
would also suffer. Fortunately, due to the fact that 
IPv4 addresses are a well-defined subset of the entire 
IPv6 address space, it is relatively straightforward to 
implement the transports so that a single set of mod- 
ules can be used over both network layers. 20 To accom- 
plish this, we increased the storage space allocated 
for addresses and separated those functions that are 
dependent upon a particular network layer. We discuss 
each of these issues in this section. 

Storing Large Addresses 

Two specific data structures must be modified to 
accommodate addresses larger than the 32-bit IPv4 
type. The first of these is the sockaddr struct, which is 
used when dealing with the BSD socket layer and 
passed along to user applications. The second is the 
Internet Protocol Control Block (PCB) data struc- 
ture, the in_pcb. In this section, we review the modifi- 
cations to each structure. 

A program that uses a transport does so by means of 
the BSD sockets interface and passes addressing infor- 
mation in a sockaddr structure. For IPv6, this is a 
sockaddr_in6. Internally, the structure is defined so 
that 64-bit alignment is preserved; however, it has the 
following public definition: 



struct 



>; 



sockaddr_in6 I 

u_c har si n6_ I e n ; 
u_char s i n 6_f a m i I y ; 
u_short sin6_port; 
u_int s i n6_f L o w L a b e L ; 
struct in6_addr sin6_addr; 



Although the concept of a sockaddr is generic in the 
BSD architecture, the How label and in6_addr mem- 
bers of this structure are unique to IPv6 and would be 
used only in the AF_INET6 address family. The details 
of this are specified in Reference 21 . 

The in_pcb data structure is created for each socket 
using TCP, UDP, or other clients of the network layer. 
In addition to storing the source and destination 
addresses, various other pieces of information required 
for proper communication are stored here, including 
the port numbers, options and flags, a pointer to the 
socket receiving the data, a header template, and a 
pointer to the routing entry for the given destination. 
For IPv6, this basic model has been retained, and addi- 
tional information is stored. This information includes 
local and remote flow labels and indicators of which 
address family the application is using and which net- 
work layer the transport communication is using. 
Finally, a partial checksum of the transport pseudo- 
header is stored here as well; its use is described in the 
following section. 

In addition to the explicit storage of the network 
layer and address family, the fundamental technique 
that facilitates the use of a common transport is the 
storage of IPv4 addresses in an IPv6 format. This is 
known as an IPv4-mapped address and is described 
in "IP Version 6 Addressing Architecture. ,,2 ° This 
address format is explicitly reserved to store addresses 
of systems that are capable of using only the IPv4 
protocol, and thus is an appropriate form of storage 
in the PCB for communications that will be sent using 
the IPv4 protocol, as opposed to IPv4-compatible 
addresses, which are sent using IPv6 packets. These 
mapped addresses are of the following form: 

0000:0000:0000:0000:0000: FFFF: 204. 123. 2. 75 

These addresses arc manipulated within the IPv4 
TCP and UDP protocols by means of macros that 
allow the IPv4 addresses to be inserted, extracted, 
or compared while in an IPv6 address structure 
(in6_addr). As an example, the code fragment in 
Figure 7 shows an address being extracted for use 
in evaluating a configurable IPv4 socket option. 

Special Transport/Network Layer Interactions 

Within the integrated transport layers, the transport 
protocol is treated independently of the particular 
network layer being used, and network layer-specific 
functions are used to interface to either IPv4 or IPv6. 

There are two particular instances in which the 
transport layer has interactions with the IPv6 network 
layer over and above the exchange of data packets for 
input or output. These are the notification and update 
of path MTU, which is required in IPv6, and the 
potential to refresh the neighbor discovery cache 
based on forward progress; i.e., if the transport knows 
that data is reaching its destination, it can validate the 
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/* 

* Test address for IPv4 characteristic 
*/ 

if ( i np-> i np_net Layer == A F_I N E T ) { 
struct in_addr tmp; 

tmp.s_addr = I N 6_E X T R A C T_V 4 A D D R ( i n p - > i n p_f a dd r ) ; 
if C ! i n_ I o c a L a d d r ( t rap ) ) 



Figure 7 

Code Fragment of a Il'v4-mapped Address 



current network layer path. We investigate each of 
these issues in turn. 

Path MTU discovery, as previously described, is 
triggered by ICMP messages processed in the network 
layer, with learned information stored in the routing 
table. In the course of processing a PTB message, the 
transport layer is notified through its control input 
(ctlinput) path. This is required because the reception 
of such an ICMP message indicates that the packet in 
transit has been discarded, thus the protocol may need 
to take appropriate action. In the case of TCP, it is 
necessary to recompute the maximum segment size 
and retransmit the affected data. Although this is not 
required for UDP, which is a pure datagram service, 
this knowledge can be made available to the corre- 
sponding socket owner. 

The other interaction between an upper layer and 
the IP layer occurs when the upper layer, specifically 
the TCP transport, wishes to indicate that communi- 
cations with a destination host has made forward 
progress, for the purpose of resetting the timer in the 
neighbor discovery cache. This positive feedback 
mechanism is described in the neighbor unrcachability 
detection portion of the "Neighbor Discovery for IP 
Version 6" specification and prevents unnecessary 
probing of the current path. 10 When acknowledg- 
ments to previously sent data have been received, die 
TCP updates the routing table entry by means of an 
RTM_CONFIRM message. This call is handled by the 
neighbor discovery module, which resets the internal 
neighbor discovery state for appropriate route entries, 
as described later in this section. 

Source Address Selection 

Many applications do not specif)' a particular source 
address to use when initiating communications 
with a remote host but instead use the symbol 
INADDR_ANY, which allows the transport to select 
a source address (and corresponding interface) to use. 
For most IPv4 systems, this is a trivial exercise if only 
a single address on a single interface exists. However, 
multiple addresses per interface will be a common 



occurrence on IPv6 hosts, and so the process of 
choosing a source address to use becomes important. 
The source address selection is typically done when 
the PCB is bound to the application's socket, but this 
function is also available to users of raw sockets and to 
other network-layer users such as ICMPv6. 

The source address selection function takes as argu- 
ments a destination address and an optional interface 
pointer. The latter is used when known, but in the case 
of initiating a transport connection it is null. The 
destination address is first checked against the list of 
current prefixes that have been advertised on the 
host's links, which would indicate which interface to 
use. (It also indicates that the destination is a potential 
neighbor, bur this knowledge is not used at this 
point.) Next, the address is tested for multicast versus 
unicast, and then the scope (link-local, site-local, 
organization-local, and global) is evaluated. Finally, 
a local address of equivalent (or greater) scope than 
the destination with the longest prefix match is 
returned. The scope must be taken into consideration 
to ensure that the destination system will be able to 
successfully respond to the communication. The 
longest prefix match is an attempt to ensure a reason- 
able routing path between the two systems, which 
could involve a complex mix of service providers. 

Checksum Optimization 

Although the IPv6 header itself does not contain a 
checksum, the TCP, UDP, and ICMPv6 protocols do 
require a 16-bit one's complement checksum calcu- 
lation to validate the integrity of transmitted and 
received data. Performing this checksum can be an 
expensive operation. While this prototype was being 
developed, some alternative mechanisms were investi- 
gated, such as varying the size of the sum variables and 
operand units and tight versus expanded loops. The 
selected algorithm offered the best performance for 
the Alpha processors, while retaining the ability to be 
used on 32-bit processors. 

At the point where the PCB is established for trans- 
port communications, a partial checksum is calculated 
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for the IPv6 pscudohcadcr, which consists of the source 
and destination addresses, the packet payload length, 
and the next header value. This partial checksum, with 
the exception of the payload length (which varies per 
packet), is then stored in the PCB, to be passed along 
with the pointer to user data within the memory buff er 
to the checksum function. The initial checksum calcu- 
lations are done using 32-bit values in 64-bit registers, 
and later are collapsed to the final 16-bit sum. This is 
coded as one large C statement, adding the various 
pseudoheader components in piecemeal fashion. This 
allows the compiler to schedule the instructions for 
optimal performance. The final packet checksum can 
then be computed by adding the partial checksums for 
the pseudoheader with the checksum values for the 
data itself, plus the payload length. 

Neighbor Discovery Overview 

The Neighbor Discovery specification describes sev- 
eral important aspects of an IPv6 node's behavior in 
relation to other IPv6 nodes connected to a common 
link. IPv6 nodes on the same link use neighbor discov- 
ers' to discover each other's presence, to determine 
each other's link-layer addresses, to find routers, and 
to maintain reachability information about the paths 
to active neighbors and remote destinations." These 
functions are performed with several ICiVIPv6 mes- 
sages and options, as shown in Figure 8. The same 
messages are also used for address autoconfiguration 
and duplicate address detection, as described in "IPv6 
Stateless Address Autoconfiguration." 7 

Interface Initialization 

When an interface is initialized for use with IPv6, a 
link-local address may be created without any external 
configuration, allowing the system to begin transmit- 
ting and receiving packets to nodes sharing a common 
link. This is performed by appending an interface 
token to the predefined link-local address prefix, 
FE80::. The length and content of the interface token 
is link specific. For example, the address token for an 
Ethernet interlace is the interface's built-in 48-bit 
IEEE 802 address, resulting in a link-local address 
such as FE80::0800:2BBE:6260." 

Duplicate Address Detection 

Before a unicast address can be assigned to an inter- 
face, a process known as duplicate address detection 
(DAD) must be performed. 7 This process provides a 
degree of assurance that two nodes do not use the 
same address on the same link. The basic mechanism 
involves sending an ICMPv6 neighbor solicitation 
(NS), where the target address is the address being 
tested. If another node is using the address, it will 
respond with a neighbor advertisement (NA). Multi- 
cast is used for both NS and NA packets, so DAD can 



be performed f«r all unicast addresses, including the 
first address assigned to the interface. 

While an address is undergoing DAD, it is said 
to be a tentative address. It is not used to receive 
packets, nor is it used in outbound packets. The 
LA6_TENTATIVE flag in the in6_localaddr structure 
identifies addresses in this state. When a duplicate 
address is detected, the error is logged and the 
LA6_DADFAILED Hag is set in the in6Jocaladdr 
structure. If a duplicate address is not detected, the 
LA6TENTATIVE flag is cleared, making the address 
available for use on the interface. 

Address Resolution 

In IPv6, the function of mapping unicast IPv6 
addresses into link-layer addresses is performed using 
ICiVlPv6 messages. This is a departure from IPv4, 
which relied on separate protocols (e.g., Address 
Resolution Protocol [ARP]) to perform this func- 
tion. 25 IPv6 unicast address resolution is defined in 
a generic manner and can be run over any link layer 
that provides a link-layer multicast service (this 
includes point-to-point and broadcast links, special 
cases of multicast). This protocol may also be used for 
nonmulticast-capablc media (e.g., nonbroadcast mul- 
tiple access [NBMA] media such as ATiM), provided 
that the link layer provides the necessary services. The 
function of mapping multicast IPv6 addresses into 
link-layer addresses is specific to each link tvpe. 

The unicast address resolution function uses two 
ICiVlPv6 message types: the NS and the NA. When a 
node needs to resolve the unicast IPv6 address of 
a neighbor to a link-layer address, it builds an NS 
containing the IPv6 address to be resolved (target) 
and sends it to the solicited -node multicast address 
corresponding to the target address. As an optimiza- 
tion, the node includes its own link-layer address as 
an option in the NS message. 

When an address is assigned to an interface, a node 
is required to join the solicited-node multicast group 
corresponding to that address, so a node will receive 
NSs sent to its solicited-node multicast address. Upon 
receipt of an NS, the target node builds an NA con- 
taining its link-layer address. The NA also contains the 
IPv6 target address, so that the soliciting node can 
associate the response with the corresponding request. 
The NA is then sent back to the soliciting node. 

Upon receipt of an NA, the soliciting node can map 
the target IPv6 address to the corresponding link-layer 
address and send any packets that were queued awaiting 
address resolution. Once a node has resolved an IPv6 
address, the link-layer address is cached until it must 
be replaced or deleted. A different link-layer address 
may be received in a subsequent NA packet, with the 
O-bit (override flag) set to indicate a new value. If 
the neighbor unrcachabilitv detection algorithm 
(explained in the next section) detects that the cached 
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value is not reachable, the mapping will be deleted. 

The address resolution process has several implica- 
tions for the implementation. Outbound packets must 
be queued pending link-layer address resolution, and 
link-layer addresses must be stored somewhere. The 
"Neighbor Discovery for IP Version 6" specification 
describes a conceptual neighbor cache to hold this 
information. 1 " The Digital UNIX IPv6 prototype uses 
several data structures to implement the neighbor 
cache. An nd6 llinfo structure keeps track of each 
entry in the neighbor cache. This structure contains 
the queue header for packets awaiting link-layer- 
address resolution. The link-layer address is stored in 
the routing table, in a host route entry for the destina- 
tion IPv6 address. The RTF_LLINFO flag in the route 
entry indicates the presence of link-layer information. 
Each nd6_llinfo structure contains a pointer to the 
corresponding routing table entry, and the routing 
table entry points back to the nd6_llinfo structure. 

The use of routing table entries to hold the link- 
layer-address information is an optimization. A rout 
ing table entry is associated with the majority of 
packets transmitted for reasons other than address res- 
olution. Storing the link-layer address in the routing 
table entry avoids the overhead of a separate link-layer- 
address table. This approach is modeled after the BSD 
4.4 system's ART implementation. 

Neighbor Unreachability Detection 

Neighbor unreachability detection (NUD) has its 
roots in the dead gateway detection in IPv4 but has 
been generalized in IPv6 to include all neighboring 
nodes (not just gateways). 34 Unlike IPv4, the mecha- 
nisms supporting NUD are an integral part of IPv6. 
IPv6 nodes monitor the reachability of neighboring 
nodes to which packets are being sent. An IPv6 node 



relies on reachability confirmations to determine the 
reachability state of a neighbor. In the absence of any 
reachability indications, an IPv6 node will periodically 
use an NS to actively probe the reachability of a neigh- 
bor. An NA sent in response to an NS provides reacha- 
bility confirmation. The S (solicited) flag in the NA 
is proxoded specifically for this purpose. If neither 
method succeeds within a given period of time, a 
neighbor is considered unreachable. Figure 9 shows 
the neighbor unreachability states. 

A reachability confirmation may take several differ- 
ent forms. Any packet received from a neighbor can be 
viewed as a reachability confirmation, provided that 
the packet would only have been sent by the neighbor 
in response to a packet sent from the local node. 
A TCP acknowledgment is one example: receipt of 
a TCP ACK indicates that a packet sent to the neigh- 
bor did in fact reach it. Another example is an ICMPv6 
redirect message. Receipt of a redirect message indi- 
cates that the neighboring router received a packet 
from the local node. 

In the Digital UNIX IPv6 prototype, the nd6Jlinfo 
structure holds NUD state and retransmit count infor- 
mation. A field in the routing table entry is used tor 
NUD timers. The RTF_LLVALID flag in the route 
entry is used to indicate that the neighbor is reachable. 
A new routing message type (RTM_CONFIRM) 
was defined to pass reachability confirmations to the 
neighbor cache. This mechanism is used by TCP upon 
receipt of new acluiowledgments. 

Autoconfiguration 

One of the goals of IPv6 is to work properly in a 
dynamic network environment without the need for 
manual intervention on each system attached to the 
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network. The solution is to allow important pieces of 
information to be learned and the system to autocon- 
figure itself" using this data. IPv6 autoconfiguration 
encompasses the following items: 

■ Router discovery 

■ On-link prefix discovery 

■ Interface attribute configuration 

■ Stateless address configuration 

■ Stare fill address configuration 

The mechanism for delivering this information to 
the hosts is the router advertisement (RA) packet of 
the Neighbor Discovery Protocol. In the following 
sections, we describe the methods we developed to 
process these packets and update the system. 

Host Autoconflguration Daemon 

To process these RAs, we designed a host daemon 
called nd6hostd, which resides in the application space 
of the Digital UNIX operating svstcm. Wc determined 
that a usei'- mode daemon was the most efficient wav 
to implement IPv6 autoconflguration for the follow- 
ing reasons: 

■ A user-mode daemon would avoid kernel bloat. 

■ Maintenance and extensibility would be easier. 

■ The function is not performance critical. 

The autoconflguration processing is implemented 
as a single executable image, as a cohesive set of tightly 
coupled modules. The daemon currently is designed 
as a single-threaded application that uses a dispatch 
mechanism to call each specialized function module in 
turn. We will examine the idea of having this daemon 
run as a multithreaded application in the future. 

The nd6hostd daemon communicates with the 
network subsystem in the kernel through multiple 
techniques. Figure 10 shows the autoconflguration 
processing modules. The raw socket interface is used to 
receive llAs, and I/O control messages (ioctls) are used 



to manipulate kernel data structures. Also, the routing 
table is updated as necessary, bv means of a raw socket 
interface to the PF_ROUTF. protocol family. 

We designed the JPv6 raw socket's interface with 
the ability to pass only specific ICMPv6 messages to 
a user and to filter extraneous packets or protocols. 
The nd6hostd daemon sets a socket option to receive 
only neighbor discovery RAs. It then executes a dis- 
patch routine that polls the raw socket, awaiting 
packets. When data is available on the socket, the dae- 
mon determines the characteristics of the message, 
creates a data structure to contain it, and calls the nec- 
essary functions to perform autoconfigurarion. The 
dispatch module, in addition to polling socket descrip- 
tors, supports necessary timer management functions 
such as creation, deletion, and expiration. Figure 11 
shows the application daemon design center. 

Kernel Interface Data Structures 

In main ways, the data link interface is the focus of 
IPv6 autoconflguration support. The kernel data struc- 
tures for IPv4 interfaces are not sufficient to implement 
the necessary IPv6 functions. We designed and imple- 
mented new interface data structures that encapsulated 
the existing IPv4 structures. This allowed us to avoid a 
recompilation of the existing data link drivers on the 
Digital UNIX operating system. In the future, we will 
attempt a design in which the interface structures for 
IPv4 and IPv6 are completely integrated. 

As shown in Figure 12, we designed an in6_ifnet 
structure to support each data link type (e.g., 
F,thcrnct, PPP, loopback) and used the existing 
ifnet structures to point to those link interfaces. The 
in6_ifner has its own in6_ifaddr structure for each 
IPv6 address configured in the data structure 
in6_localaddr. We also defined the in6_rourer struc- 
ture to support each router available for the imple- 
mentation. The in6_router structure specifies the 
interface of the router, neighbor cache route, and 
the IPv6 address of the router. 
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Interface Attribute Autoconfiguration 

To autoconfigure the interfaces for IPv6, we created 
new ioctl functions to create, delete, update, and 
access the interfaces. In addition to their use by the 
nd6hostd daemon, these ioctls may be used by any 
future modules that need to access or manipulate the 
interfaces. This might include specialized configura- 
tion utilities, Simple Network Management Protocol 
(SNMP) management functions, security tools, or 
Other sci vices. 

The interface module to update and maintain inter- 
face structures for nd6hostd serves two purposes: to 
update data link attributes provided by ilk. RA, and tu 
maintain the data structures as a set of linked lists for 



router discovery, on-link prefixes, and address configu- 
ration. Figure 13 shows the interface attribute updates. 

Router Discovery 

An RA packet has mandatory and optional parts. 
Before a default router is added to the routing table, 
the following interface attributes must be determined: 

1 . Receiving interface 

2. Current hop limit 

3. Reachable and retransmit times for use in NUD 

The link-local address from the source link-lnyer 
option of the RA is then added to the routing table, 
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and the kernel elara structures tor router information 
are updated. The router lifetime field in the RA defines 
how long this router may be used as a default router. 

The nd6hostd daemon first updates the interface 
attributes. A timer is set using the appropriate routine 
from the dispatch module. When the timer expires, 
the delete default router routine is called, and the 
router is deleted from the routing table. The daemon 
must also be able to delete the router if it receives an 
RA with a zero lifetime value, which can occur when a 
node is acting as a router but is reset to be a host. 

On-link Prefixes 

An on-link prefix in TPv6 defines a subnet and is tvpi- 
cally configured on a router for a specific link by the 
network administrator. The router then advertises this 
prefix to all nodes connected to that link as a prefix 
option, appended to an RA. A prefix option defines a 
single prefix only, but an RA may contain more than 
one such option. As shown in Figure 8, the prefix 
option provides the following information: 

■ Prefix length 

■ Link- or L-bit, which is set if the prefix is direetlv 
readable on link (i.e., a neighbor) 

■ Autonomous- or A-bir, which is set if the prefix can 
be used for stateless address configuration 

■ The length of time the prefix is valid 

The daemon adds the prefix to the routing table. 
Then a rimer routine is called from the dispatch mod 
ulc and in set for the nine the pre I ix in valid. When the 
dispatch routine calls the delete on-link prefix module, 
the prefix is deleted from the routing tabic. A prefix 
cull also he deleted when a new RA presents the pie 
fix with a lifetime of zero. In that case, the on-link 
prefix module will stop the timer routine and delete 
1 1 u pi eli \ I ioi ii l he lonling I .d >lc. 



Address Configuration 

Address configuration is one of the new paradigms 
that must be supported in IPv6. Two configuration 
methods, stateless and stateful, are provided to auto- 
configure addresses for a host. The M-bit flag in an RA 
message determines which method to use and informs 
a host. In addition, the other-bit (O-bit) flag is pro- 
vided to configure other network parameters required 
for the host's operation on the network when the 
stateful configuration is used. 

Address autoconfiguration in IPv6 supports the 
ability to dynamically renumber a link or a complete 
network through the use of lifetimes specified in the 
RA message. The valid lifetime is the time the address 
has before expiration. When the timer expires, all con- 
nections using that address are dropped by the imple- 
mentation, and no new connections arc permitted. 
The preferred lifetime is provided to inform an imple- 
mentation that an address is about to expire; it typi- 
cally is set to a lower value than the valid lifetime. 
When this timer expires, the address is said to enter the 
deprecated state, at which point an implementation is 
permitted (as a configuration option) to prevent new 
communications using this address as a source or des- 
tination. This model is designed to provide network 
administrators with control •ver the use of network 
addresses without manual intervention of each host on 
the network. The stateless model is intended for users 
who do nor need tight control over address config- 
uration; stateful mechanisms will be used where the 

iHmiilKrrifnr^ wmr tt\ dcU'S-t^tS* Sod ro**^s#*fi bn S'pH nn 1 

client/server method. Figure 14 shows the address 
autoconfiguration diagram. 

Whrn rhr Tn-mnn rpceivrs nn RA and rhc A-hit is 
set, i hi' daemon cm use the prefixes provided 10 per 
form stateless address configuration. The daemon uses 
the on-link prefix(cs) provided in the 11A to configure 
addresses tot' in interface. Addresses arc ercsuyi, 
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deleted, or updated on the interface based on the pre- 
fixes and lifetimes received in the RA packer. 

Interactions with Stateful Address Configuration 

When the daemon receives an RA, and the managed 
hit flag is set, the host can use stateful address con- 
figuration, using OHC.IY6. DHCIN6 is implemented 
as a separate daemon process in our prototype. 
DHCPv6 defines a complete new model from the 
existing DHCP\4 implementations in the industry to 
dynamically configure addresses. The use of link-local 
addresses, multicast, address configuration, and inher- 
ent support for dynamic renumbering of hosts in 
IPv6 caused a new architecture and design in the 
DHCP\6 specification. A comparison of the architec- 
tural changes between DHCIY4 and DHC1V6 can be 
found in the DHCPv6 specification/ 

Application Services 

Most TCP and UDP applications can be used with 
I Pv6 with relatively minor modifications. The primary 
issue is the larger address size, both for internal storage 
needs in the application and for address transfer across 
system interfaces. In this section, we review these 
issues and others. 

API 

Ajiv API currently in use for IPv4 could be modified 
for IPv6, bur only the BSD sockets API is being inves- 
tigated within the IKTF for two reasons. First, large 
numbers of" applications use the sockets interface for 



IPv4, which represents a very large investment and a 
potential pool of IPv6 applications. Second, this API is 
perhaps in the most widespread use in the industry and 
is available on a wide variety of platforms: the benefits 
of standardization are compelling. 

DNS AAAA Support 

DNS provides support for mapping names to IP 
addresses and mapping I P addresses back to their cor- 
responding names. : " The type A resource record is 
used to hold an IPv4 address. Since its size is fixed 
at 4 bvtes, a new resource record rvpe, AAAA, was 
defined to hold IPv6 addresses.'" The Digital UNIX 
IPv6 prototype includes a widely used implementation 
of the DNS known as Berkeley Internet Name- 
Domain (HIND), which has been modified to support 
AAAA records. 

Address Manipulation Routines 

A typical IP implementation provides several library 
routines for manipulating IP addresses. These include 
routines for converting addresses between binary and 
textual representations and routines for translating 
names to addresses and addresses to names. New rou- 
tines had to be provided to perform these functions 
for IPv6 addresses. 'Die Digital UNIX IPv6 prototype 
provides the routines described in "Basic Socket 
Interface Extensions for IPv6."- 1 

inetd Daemon 

The inetd daemon creates sockets on behalf of applica- 
tions, invoking the applications only when needed and 
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passing the open sockets to than. With the advent 
of the AFINRT6 socket tvpc\ inetd was modified 
to accept a new application configuration option in 
its configuration file. The keyword inet6 is used to 
indicate an application that wants to use AP_INET6 
sockets. The keyword met (or the absence of a key- 
word) indicates use of AF_INHT sockets. 

Applications 

A typical application needs only minor modification to 
use the AFMNKT6 address family Applications that 
use addresses .is pai r of their design or protocol, such 
as the File Transfer Protocol (FTP), require more 
extensive modification. The Digital UNIX IPv6 proto- 
type includes several basic applications that have been 
modified to support IPv6, including Telnet and FTP. 
These programs were modified to use IPv6 sockets, 
address structures, and library routines. Note that the 
IPv6 sockets also support communications over IPv4, 
so that applications need nor maintain separate sockets 
for IPv4 and IPv6, and a single executable image can 
intemperate with both types of remote system. 

Future Work 

Future implementation efforts will include security, 
routing, srateful address configuration, dynamic 
updates to DNS, IPv6 over PPP and ATM, resource 
reservation, and service location. In addition, we will 
review elements ol our existing design and implemen- 
tation architecture to increase performance and to ease 
the transition from IPv4 to IPv6. We will continue to 
participate in the IPv6 industry mulrivendor interop- 
erability events, which is a practical and concentrated 
effort to debug the specifications and the code base. 

IPv6 security supports both the authentication and 
the encryption of IPv6 packets end-to-end The mod- 
ule for these functions will reside in the kernel and most 
likely will be called at the point where the IPv6 network 
laver packer is processed. A key management frame- 
work is being developed to support both authentica- 
tion and encryption. To access the key management 
interface, a sockets API extension will be provided to 
supply the keying criteria for the security modules. 

To test the interoperability and robustness of 
the Il'v6 implementations, a test network known as 
the 6ISONF has been created on the Internet. This 
nascent test bed is currently being built with statical!)' 
defined tunnels connecting IPv6 networks. Our next 
step in I Pv6 development will be to implement rout- 
ing protocols, starring with Routing Information 
Protocol version 6 (RIPv6) for unicast routing. 
Subsequent goals will be to support Open Shortest 
Path First version 6 (OSPFY6) and to provide multi- 
cast routing. 



Srareful address configuration will be implemented 
as specified in DHCPv6 and will contain a client, a 
server, and a relax agent. This work will be tightlv cou- 
pled with dynamic updates ro DNS to provide auto- 
configuration in conjunction with autoregistration in 
the directory service. Hven for networks that use state- 
less address autoconfigurarion, DHC1Y6 will be avail- 
able ro configure other parameters for the host and to 
add, delete, and update name information associated 
with addresses in DNS. 

Additional data link interfaces will be supported for 
PPP and ATM. These nonbroadcast architectures will 
require some design analysis to implement in order to 
support neighbor discovery, auroconfigurarion, and 
the routing models for IPv6. Digital has been activ e 
within the I FTF working groups that are defining the 
ATM solutions. 

IPv6 now supports flow information in the IPv6 
header and in the IPv6 BSD socket API structure. This 
inherent qualiry-of-scrvice (QOS) mechanism in IPv6 
meshes well with efforts to support reserve resources 
on a network as specified in the Resource Reservation 
Protocol (llSVP).'" Using RSVP over broadcast and 
nonbroadcast data links will encompass a design cen- 
ter that supports a wide range of resource reservation 
parameters to maintain a consistent performance 
model for video- and audio-related applications across 
a network path. 

Service location is an emerging technology that will 
permit a host to query the network about the location 
of different services (e.g., NFS, security key manage- 
ment, directory serv ices)." 1 Currently in development 
for IPv4, service location holds promise for IPv6 and 
mav benefit from the greater lev el of support for basic 
technologies, such as security and multicast capabilities. 

Summary 

Digital has designed a prototype of IPv6 on the Digital 
UNIX operating system. Techniques and technologies 
have been developed to accommodate aspects of the 
IPv6 architecture; in particular, the transport laver 
modules were modified ro use two distinct network- 
layer protocols. The new Neighbor Discovery Protocol 
and algorithms have also been implemented in the pro- 
totype. IPv 6 includes mechanisms to do both stateless 
and starcfiil address configuration as well as router dis- 
covery. The Digital UNIX IPv6 prototype contains 
a user-mode process that implements these functions. 
In addition, enhancements have been made to IPv4 
services, and techniques have been developed to sup- 
port the transition of existing applications. 
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Maxwell M. Burnet 
Robert M. Supnik 

Preserving Computing's 
Past: Restoration and 
Simulation 



Restoration and simulation are two techniques 
for preserving computing systems of historical 
interest. In computer restoration, historical sys- 
tems are returned to working condition through 
repair of broken electrical and mechanical sub- 
systems, if necessary substituting current parts 
for the original ones. In computer simulation, 
historical systems are re-created as software 
programs on current computer systems. In each 
case, the operating environment of the original 
system is presented to a modern user for inspec- 
tion or analysis. This differs with computer con- 
servation, which preserves historical systems 
in their current state, usually one of disrepair. 
The authors argue that an understanding of 
computing's past is vital to understanding its 
future, and thus that restoration, rather than 
just conservation, of historic systems is an 
important activity for computer technologists. 



The Computing Past 

The continuous improvements in computing technol- 
ogy cause the rapid obsolescence of computer svstems, 
architectures, media, and dev ices. Since old comput- 
ing systems are rarely perceived to have any value, the 
danger of losing portions of the computing record is 
significant. When a computing architecture becomes 
extinct, its software, data, and written and oral records 
often disappear with it. 

Older computer systems embody major investments 
in software, the value of which may persist long after the 
svstcms have lost their technical relevancy. For example, 
the PDP-11 computer has not been a leading-edge 
architecture since the introduction of 32-bit svstems 
in the late 1970s and has not received a new hardware 
implementation since 1984. Nonetheless, PDP-11 sys 
terns continue to be used worldwide, particularly in 
real-time and control applications. The unavailability 
of suitable replacements of worn-out original parts is 
a serious issue for PDP- 1 1 svstems still in use. 

Another area of potential loss is data. In recent 
years, archival storage media have undergone rapid 
technologic evolution, and the industry standards of 
computing's first 30 years, such as 0. 5-inch magnetic 
tape, are now antiques. Salvaging data from original 
media is an industrv-wide problem and has generated 
a small cottage industry of specialists in data recovery 
This problem will onlv proliferate, as transitions in 
media types accelerate. Ten years from now, the large- 
diameter optical disks used for today's archives will 
look as quaint as DHCtapc and magnetic tape storage 
svstems do to current computer users. 

Finally, the disappearance of older equipment typi- 
cally entails loss of information: not onlv design 
sketches, blueprints, and documentation but also the 
folklore about these systems. The absence of system- 
atic archiving, as well as the absence of a perceived 
value of the archived data, causes continual informa- 
tion decav about design and operational details. 

This paper describes two techniques for preserving 
computing systems of historical interest. The first 
section of the paper discusses the restoration of old 
computers to working order. It also includes a descrip- 
tion of the Australian Museum collection and the 
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process of restoring a particular PDP-1 1 minicom- 
puter. The second section discusses the simulation 
of old computers on modern svstcms. It describes a 
simulation framework called SIM, which has been 
used to implement simulators for the PDP-8, PDP-1 1, 
PDP-4/7/9/15, and Nova minicomputers. 

Restoring Old Computers 

Since the computer became a mass-produced item in 
the late 1960s, its typical life cycle has consisted of initial 
installation, rental or depreciation for about five vears, 
retention and use for a few more vears (just in case), and 
then retirement and a trip to the refuse dump. There is 
onlv a brief window of opportunity to collect old com- 
puters at the end of their working life. Once that win- 
dow is closed, the computers arc gone forever. 

The Australian Museum Collection 

In Sydney, Australia, this window of opportunity 
first became apparent in 1971, when the early PDP 
systems reached the ends of their life cycles. Digital's 
Australian subsidiary began collecting systems by a 
creative program of trade-ins for new equipment. 1 It 
was especially urgent to obtain examples of the 12-bit, 
18-bit, and 36-bit POP series, as they were relatively 
few in number. Table I lists the percentage of available 
units that have been collected. The status of each is 
given as 

■ Static — can never be made to work for various 
reasons 

■ Restorablc — could be made to work with enough 
care, patience, time, and effort 

■ Working — running its operating svstem the last 
time it was turned on 



Once a representative sample of the early PDP 
systems had been collected, the urgency abated. 
Hundreds of PDP-11 and VAX systems were then 
brought to Australia; the window of opportunity for 
collecting them is still open. 

The collection has grown significantly during the 
last 25 years. At the present time, we have in Sydney 
a comprehensive collection of most early Digital 
machines, including hardware, manuals, software, and 
spares (see Table 2). The collection is catalogued in 
a 6,000-line database that resides, appropriately, on a 
MicroVAX 1 computer, running the first version of 
the MicroVMS operating svstem. Figure 1 shows an 
example from the collection, a PDP-S/E computer 
system with peripheral equipment. 

The goals of the collection are varied and are sum- 
marized in Table 3. Apart from the academic challenge 
of keeping all old data media running, there is the 
responsibility to ensure that they can be kept alive and 
available. The extensive variety of media types offered 
by Digital alone in only 30 years is summarized in 
Table 4. The evolving status of the collection has been 
reported at several Australian DBCUS Symposia.'' 
The restoration of the Australian collection will prob- 
ably ensure a retirement job for the curator lor the 
next 30 years! 

General Issues in Restoration 

Restoration is a painstaking and time-consuming 
process. The goal of restoration is to return a system to 
a state where it will reliably run a major operating sys- 
tem and offer as many media conversion facilities of 
the vintage as possible. Fortunately, computers do nor 
deteriorate greatly in storage, provided the storage 
area is dry. (One item that does decay dramatically is 
the black foam used to line side panels and to separate 



Table 1 



Early Digital CPUs in Australia 


Model 


Number Brought 


Number in 




Name 


to Australia 


Museum Collection 


Condition 


PDP-5 


1 


1 


Restorable 


PDP-6 


1 


1 


Some items 


PDP-7 


1 


1 


Static 


PDP-8 


28 


3 


Working 


PDP-8/S 


20 


2 


Static 


LINC-8 


2 


2 


Restorable 


PDP-9 


7 


1 


Restorable 


PDP- 10 


8 


1 


Some items 


PDP-12 


2 


2 


Restorable 


PDP-8/I 


24 


2 


Restorable 


PDP-8/L 


21 


2 


Restorable 


PDP-15 


10 


1 


Static 


PDP-8/E 


90 


4 


Working 
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Table 2 

The Digital Australian Collection (chronological order) 



Year 


Item 


Description 


Status 


i y do 


1 3 S3 


A/D converter 


Ctat \r 
3ld IIC 


1 Q^O 

i you 


A CD 33 


Tolo+v/riQ roorlor/ni inrh 1 1 h :ai in 

i eieiype reduer/punen, i iu uduu 


\ft/nrL inn 

vvOiKiri y 


1 QF»7 


l\jf\ jj 


l-lQ2i/>/_ni itw To o + w n o 
ntrdvy-UUiy 1 tr 1 tr Ly ptr 


VV Ul r\ 1 1 1 y 


1 Q£3 


PHD £ 


ft A /"\ n i i 1 q c /~\ t ti ret R i /~i i t ^ 1 /~ /~v m r\i itor i r™» Aiictr^li^ 

iviociuies ot ursi uiyiidi computer in MUbirdiid 


Parte 
rdl Lb 


1 Q£3 


PHD ^ 


nrsi fii i niLuiiipu ler in MUbirdiid 


\ A//>r L/ 1 f-\ /— 1 

vvor Kiny 


1 


pnD 7 
rUr- / 


i mru uignai compuier in Musirdiid 


C tatir 
D Id IIC 


1 Qf;^ 
i yoj 


PHD Q 
rUr-o 


vJdsstc, iduie-iop moaei 


Working 


1 jD J 


PHP Q 
r Ur-o 


Cabinet model 


Restorable 


1 Qf; 1 ; 
I yoj 


PHP S3 
rUr-o 


Typesetting system 


Ct ^1 1 \r 


1 Qf;^ 
i yoj 


PHP S3 
rUr-o 


v-duinei mooei, Tirsi in i\jew i.edidna 


KesioraDte 


I "Dj 


rr\DP 
e_LJr z-^D 


Domrtto k -3 + ^ h ff~^Pft/i pnp Q^ 
KemOie Ddicn ^L/tlvl rUr-oj 


Kesiora Die 


I jDD 


PHP Q 
r Ur-y 


18-bit computer 


C tatir 
j Id UL 


1 QF>F> 

I -7UU 


k - a 1 o 

i\ M I U 


v_UiibUitr kj\ rUr- IU lildiMlldiiic 


3 Id IIC 


1 Q&C* 

i yoD 


I inr H 


cdiiy iTicuicdi cuiiipuier 


vvorKing 


1 967 


PDP-8/S 


Spri^l undprtiin 000 TPl 1 


J La 11 L 




PHP S3/C 
rUr-o/j 


_>eiidi cuiiipuier 


C + atir 
Did IIC 


i yo / 


np^7 


Uly 1 Id 1 S Tl r St U ISK, 1 / 1 D IVI D 


Ct t \e- 

bianc 


1 Q£7 

i yo / 


PHP Q/l 
r Ur-y/L 


LdSi irdnsisior logic, io-dii 


Ct ^t \r 

jiaiic 


1 QF»R 

1 -7UO 


rur of I 


n inits 'c Tirct If m inirr\ m r^i i i to r 
LJILjlldl b lllbl lv_ 1 1 1 1 1 1 ILDI 1 ipu Itrl 


\A/nrl/ inn 

vvor Kiny 


1 I7UO 


rur o/ l 


riFM wprtinn nf PnP-R/l 
wlivi vci iiui i ui rur o/i 


Ctat ir 
j Ld Lie 


i yoy 


PHP 1 7 
rUr- I Z 


LdDoraiory compuier 


Working 


1 -70-7 


rUr 1 Z 


Lduuidiory coinpuier 


C t ^ \\r 

Did IIC 


1 Q£Q 
I y oy 


pnp 1 ^ 
rUr I j 


1 act r\-f 1 Q hit -f amilw 
LdSl OT lo-Ull TdfTlliy 


Ct ^ t \r 

DIdllC 


1 QF>Q 


k"l 1 o 

i\l I u 


(~ nncnlo nPf - c\/ctom 10 
\_UPlbUltrUI L/C v_byb L trl 1 1 - I U 


C tatir 
D Id Lie 


1 Q70 


php s/p 


D i r"i r"i -arid /^+ PHP Q ncmolnn rvi ont 

rinridLie ot rur-o ueveioprTieri l 


\ A / r\ rl/ i n n 

vvorKing 


1 Q70 

i y / u 


php S3/P 

rUro/t 


Full LAB 8 configuration 


Working 


1 Q70 

i y / u 


php 1 1 /7fi 
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Tho -f i ret pnp 1 1 
i ne Ti rsi rur - 1 i 


Working 


1 Q70 

i y / u 


f~P 1 1 


Lara reaaer, zod cpm 


Working 


1 97 1 


PDP-R/F 
rur of r 


Sm^ll PnP-S/F 
jiiiaii rur of l 


V VU( Mi IU 


1 Q7 1 

i y / i 


\/T0^ 


Digital's first video terminal 


Working 


1 Q7 1 
i y / i 


I A 30P 


n i /-i i + al'c Tirct h^r/H cftrtii tor m ! n 2 

uiyiidi s Tirsi ridr (j-copy leriTiiridi 


Working 


1 Q7 1 
i y / i 


pnp 1 1 m ^ 

rUr - I I /*t J 


I act pnp 1 1 
LdS L rUr - I I 


C + ^ + 

bxatic 


1 Q77 

i y / z 


KJ I *4U 


Graphics workstation 


Broken 


1 Q7 7 

i y /z 


pnp 11/in 

rur- I 1/ I U 


Cr»r^ii pnp 1 1 
bman rur- I I 


Static 


1 Q73 

i y / 3 


pnp 1 1 cin 
r Ur - 1 1 1 1 U 


First packaged system 


Working 


1 Q73 

i y / 3 


pnp 11/3^ 

rUr I I/jj 


fA/lii-l r^nno PnP 1 1 

ivnu-range rur- 1 i 


Static 


1 Q73 

i y / 3 


pnp r/a 


Last non-cnip rur-o 


Working 


1 Q74 
i y / *4 


pnp 1 1/zin 


l\^i*~1 r "\ t~^\ /~i t~~\ r\r~\ r\ 1 irnr D P 1 1 

ivnu-range, enu-user rur- 1 i 


Restorable 


i y / j 


V 1 DU 


Video terminal 


Working 


1975 


LA36 


nFf~\A/ri + pr II nrinfpr 
L/r.v_vvi i ici ii piiMLcri 


\ A /n r 1/ in n 

vvor King 


1 Q7S 


n^3 1 n 

U J J 1 u 


ncicl/ hacfjn r r\ m m ar/~\ 3 ci/ctam 
UtrbK. UdbtrU LUI 1 1 1 1 1 trr L I d I byb Ltrl 1 1 


vvorKing 


1 Q7^ 
i y / j 


pnp 1 1 /70 

r ur i i / / u 


1 arnoct PHP 1 1 

Ldi yesi rur- I I 


Restorable 


1976 


PDP-11/34 


Mid-range PDP-1 1 


Working 


1977 


PRS01 


Portable paper tape reader 


Working 


1977 


LS120 


DECwriter printer 


Working 


1977 


WS78 


Word processor, 8-inch floppy disks 


Working 


1978 


LAI 20 


DECwriter III printer, 180 cps 


Working 


1978 


VAX-1 1/780 


Original unit of 1 VAX-1 1/780 


Restorable 
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Table 2 (continued) 



Year 


Item 


Description 


Status 


1979 


VT100 


Famous video terminal 


Working 


1980 


MINC 


LSI - 1 1 lab unit with RT-1 1 


Working 


1980 


VAX-1 1/750 


Mid-range VAX system 


Restorable 


1980 


PDT-150 


Table-top LSI- 1 1 with RX01 drives 


Working 


1981 


GIGI 


Low-cost terminal for schools 


Working 


1982 


VT125 


Video terminal with graphics 


Working 


1982 


WS278 


DECmate I word processor 


Restorable 


1982 


VAX-1 1/730 


Low-performance VAX system 


Working 


1982 


LA12 


Portable hard-copy terminal 


Static 


1982 


LQP03 


Letter-quality printer 


Wo r k i n g 


1982 


DECmate II 


Word processor on mobile stand 


Wo r k i n g 


1982 


DECmate II 


Word processor 


Working 


1982 


Rainbow 


Personal computer 


Working 


1982 


PRO350 


Professional PC 


Working 


1983 


VT241 


Graphics color terminal 


Working 


1983 


MicroVAX I 


Smallest VAX .3 VUP 


Working 


1983 


VAX-1 1/725 


Lowest cabinet VAX .3 VUP 


Working 


1984 


LN03 


Laser printer 


Wo r k i n g 


1985 


MicroVAX II 


Famous MicroVAX II 


Working 


1986 


VAX mate 


286-based PC with RX33 drive 


Working 


1986 


DECmate III 


Small word processor 


Working 


1987 


MicroVAX III 


3-VUP MicroVAX II system 


Working 


1987 


VAX 8250 


Dual VAX CPU, Bl-based 


Restorable 


1989 


VAX 9000 


Chip set 


Static 


1990 


DS3100 


Mips UNIX workstation 


Restorable 



ribbon cables. After 20 wars, ir rums into a stickv, 
gooey mess. It should be remov ed as soon as possible; 
otherwise, it falls into the modules and backplane. 
Replacing it with a modern cc]ui\ alenr can be done but 
is not essential.) 

The first step in restoration is to collect hardware, 
software, and documentation. 

■ Collect the hardware, if possible two or ideally 
three items of each example. This provides a system 
to work on and a spare, as well as the ability to make 
comparisons between units. 

■ Collect diagnostic and operating software on origi- 
nal bootstrap media. Sources are very useful, partic- 
ularly tor diagnostics. 

■ Collect hardware manuals and schematics. 

There is a network of enthusiasts around the world 
who can help at this stage. 

Once the "ingredients" have been collected, the 
steps needed to restore a 1960s or 1970s vintage 
machine are as follows: 

■ Inspect the hardware for physical safety, particularly 
the heavv drawers and slide mechanisms. 



■ Physically assemble the hardware, cheeking module 
allocations, cabling, etc. 

■ Carefully inspect the power system, high-voltage 
sources can kill. Although most of the pow er w iring 
material appeal's to stand the test of time, the early 
machines often had rather thin coverings on termi- 
nals. Safetv-first is a principal criterion in restora- 
tion, since someday nontechnical people may open 
the back door. 

■ Assemble a minimal system of CPU, memory, and 
console switch register for initial tests. 

■ Power up the computer, checking supply voltages, 
tans, and front console tor signs of life. 

■ Use simple routines at the switch register to check 
for elementary operation. 

■ Fit a serial line unit so that a VT or a Teletype con- 
sole can be used. 

■ Get the keyboard echoing to the screen or printer 
with simple routines. 

■ If they are available, run the internal tests of the 
read-onlv memory (ROM). 
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RX01 

DUAL 8-INCH FLOPPY DISKETTES 



TD8E TU56 

ACCUMULATOR TRANSFER 
DUAL DECTAPE SYSTEM 



PC8E 300 CPS READER. 

50 CPS PUNCH PAPER TAPE 



PDP-8/E CPU WITH EXTENDED ARITHMETIC 

ELEMENT. 16K WORDS MEMORY. 

KL8E 2400-BAUD CONSOLE. 

KL8E 2400-BAUD COMMUNICATION PORT, 

DECTAPE BOOTSTRAP. RK05 DISK BOOTSTRAP. 

REAL-TIME CLOCK 



RK05 REMOVABLE 2.4-MB 
r CARTRIDGE DISK 



STORAGE RACK FOR 10 
DECTAPE SYSTEMS 



H861 POWER DISTRIBUTION 



Figure 1 

PDP-8/K Computer System 



Conventional wisdom would now advise that all the 
diagnostic routines he run. However, diagnostics were 
(philosophically) always used to find bugs in a previ- 
ous])' good machine; they are too complex when huge 
chunks of the machine might still be missing. The 
most practical next step is to get mass storage on-line. 
Depending on the manufacturer, the target device 
may be a floppy disk drive, a cartridge hard disk drive, 
or some form of magnetic tape. With a working mass 
storage device and a bootstrap routine, it becomes 
possible to boot a simple operating system (like OS/8 
or 11T-11 for Digital's systems). This quickly shows 
whether the machine is working or not. 

If a mass storage device is nor available, the next best 
thing is paper tape. This can be cither the svstem's 
rack-mounted reader and punch or the paper tape 
reader on an ASR33 or AS1G5 console. The relia- 



bility is questionable, however, and the procedure is 
tedious. Man 1 .' diagnostics were on paper tape, but 
usually the quickest test is to load a complete paper 
system (such as JFOCAI. for Digital's systems). If the 
diagnostics run, the system is probably functional. 

Once the CPU, console, and memory are verified, 
additional peripherals can be added, one at a time. It 
pays to take the time and effort to research bus 
addresses, interrupt vectors, power supply loading, 
and module placement, and to keep a log book with 
configuration diagrams and results. In general, if the 
configuration rules are followed, the items will work. 
There are few electronic failures, even in 20- or 30- 
year-old modules. When a problem arises, it is usually 
address vector strapping, physical damage, or missing 
cables. Corrosion of board contacts can be a problem; 
they should be cleaned with a clean cloth or card board 
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Table 3 

Goals of the Australian Digital Museum 



Table 4 

Digital Data Media from 1 960 to 1 996 



To preserve one of each model of Digital's computers 

To keep each major Digital operating system working 

To have a working unit of each Digital terminal, con- 
sole, and PC 

To provide conversion and archival facilities for old 
media 

To preserve significant Digital literature and manuals 

To preserve a VAX-1 1/780 computer as the original 
unit of 1 VUP 

To disseminate instructive and educational material 

To educate and amuse our staff, our customers, and 
the public 

To support the DECUS NOP (nostalgic obsolete prod- 
uct) Special Interest Group 

To preserve spares, tools, test gear, and documenta- 
tion to keep the collection working 

To preserve and protect these treasures for future 
generations 



(for example, a business card), not w ith a pencil eraser, 
which leaves residues. Silicon components appear to 
be very stable and a tribute to the conservative design 
principles of early computer engineers. 

The main components that seem to age are power 
supply capacitors, fans, and lights. The filter capaci- 
tors across the high-voltage sources can short, and 
reference electrolytic capacitors in power supply regu- 
lators can dry out. Although the large capacitors in 
power supply RC filters have proven to be reliable, 
some restorers replace them as a matter of course lor 
safctv reasons. Small rotary fans may seize if they have 
logged many hours. Incandescent panel lamps are 
alwavs failing and can be replaced by modern light- 
emitting diodes (LFDs) if required. The irony is rhat 
the panel lamps are needed only during initial check- 
out; once the operating system is running, they are 
rarely used. 

Once restored, are old units reliable? Experience 
proves that they are. A classic PDP-8 system restored 
in 1988 still turns on happily (untouched) eight years 
later. A fully configured PDP-8/E system is still work- 
ing four years after restoration. 

Restoring a Minicomputer: A Case Study 

An ongoing project is the restoration of a large, 
UNIBUS-based PDP-11 system with many UNIBUS 
peripherals attached to it. The project was started 
using the original PDP-11/20 CPU. Since many 
PDP-11 peripherals were designed long after the 
PDP-1 1/20 CPU, it could not cope with single-board 
direct memory access (DMA) devices, metal-oxide 



Paper tape 

80-column punched and mark sense cards 

7- track, half-inch magnetictape 
9-track, half-inch magnetictape 
DECtape and LINCtape systems 
Audiocassette 

DECtape II cartridge (TU58) 
CompacTape (TK50, etc.) 
Quarter-inch cartridge tape 
Digital audio tape 

8- inch floppy disk 
5.25-inch floppy disk 
3.5-inch floppy disk 
RK05 removable disk 
RK06, RK07 removable disk 
RL01, RL02 removable disk 
RP01 ...RP06 removable disk 
RM03, RM05 removable disk 
RC25 removable disk 



semiconductor ( MOS) memory, and other later inven- 
tions. The project refocused on the mid-range 
PDP- 1 1 /34, w hich in retrospect has proved wise. The 
PDP-1 1/34 supports MOS memory, has an LED and 
push-button console, and represents a mature imple- 
mentation of the PDP-11 instruction set. It has an 
optional cache, batterv backup, floating-point opera- 
tion, and the extended instruction set (E1S). 

The current configuration occupies three large cab- 
inets in what used to be the dining room of Max 
Burnet's house. The virtues of the UNIBUS are many; 
in particular, it allows modular connection of I/O 
devices and other components. Howev er, I/O devices 
of the era often weigh 100 pounds and are mounted in 
10-inch drawers; their sheer physical size and weight 
are disincentives to reconfiguration. 

The project currently uses the R T- 1 1 operating 
system because of its simplicity and extensive device 
drivers. Eventually, it may be possible to run the 
RSX-1 1M and the RSTS/K systems, but there is litrle 
to gain from a media conversion point of view, because 
RT-11 includes utilities for dealing with foreign file 
formats. 

The main difficulties encountered have been associ- 
ated with the power supply: the DC low signal threads 
its way through every peripheral. The absence of 
UNIBUS grant continuity cards can create havoc. 
Since this PDP-1 1 system is very large, it is straining 
the design rules concerning floating vectors, current 
loading, and bus loads. 
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The CPU and memory are relatively easy to eheek 
out. Due to the versatility of the UN IB US, however, 
checking out the I/O system is very laborious. 
Starting with programmed I/O tests works best, fol- 
lowed bv interrupt tests, and finally DMA or non- 
processor reference (Nl'R) tests. Experience shows 
that tests need to be rerun whenever a new peripheral 
is added. 

The system currently runs the RT- 1 1 version 5.04 
operating system on a configuration comprising 

■ RT- 1 1/34 CPU with real-time clock and bootstraps 

■ 256 kilobits ot'MOS memory 

■ RXO 1 and RX02 floppy disks 

■ Dual RL02 disks 

■ TU56 dual DECtape storage system 

■ TU58 DECtape II storage system 

■ Serial line units for console and serial printer 

■ CM 1 1 mark sense and CR1 1 punched card reader 

■ TU 60 cassette 

■ PCI 1 paper tape reader and punch 

Although the following peripherals are available, 
they await installation time and effort: 

■ l.PS-40 analog-to-digital (A/0) convener 

■ TU 10 magnetic tape 

■ TSV03 magnetic tape 

■ Cache and commercial instruction set 

■ Batterv backup kit 

The eventual goal is to keep "the last great 
(UNIBUS) PDP-11" running with almost every 
UNI BUS peripheral ever made.' 1 Time will tell. 

Simulating Old Computers 

A simulator is a computer program operating on one 
computer system (known as the host svstem) which 
mimics the behavior of another computer svstem 
(known as the target system). The simulator's data is 
the state of the target computer system — registers, 
memory, timed events, and so on. The simulator oper- 
ates on presented state and transforms it, usually by 
sequential evaluation, in the same manner as would 
the target computer system. 

Simulators typically consist of an execution engine, 
which perforins the state transformations; a simple 
timed-event mechanism, which supports deferred and 
asynchronous events such as I/O completions; and a 
control panel, which provides user access to simulated 
state. The execution engine is responsible for decoding 
instructions in simulated mcmorv and performing the 
specified alterations of simulated machine state. The 
execution engine keeps track of simulated time in arbi- 
trary units, which mav be precise representations of the 



execution time of the target svstem, or simple represen- 
tations of advancing time, such as the number of 
instructions executed. The event mechanism provides a 
way to schedule events, such as I/O completion, for 
later evaluation. It can also implement other time- 
based mechanisms such as keyboard polling. Finally, 
the control panel prov ides access to simulated state as 
well as basic control commands such as start and stop. 
It may also provide more elaborate facilities to support 
performance instrumentation or debugging. 

Historically, simulators have been used for many 
purposes, including the following: 

■ Design of new svstcms. The simulator mimics the 
behavior of a future chip or computer system and is 
used to understand and debug the behavior of the 
proposed design. For example, prior to fabrication, 
all modern microprocessors are extensively simu- 
lated, first as abstract performance models and then 
at increasing levels of detail. " 

■ Debugging for embedded systems. If the simula- 
tor contains facilities for program debugging, it 
becomes a useful tool for debugging programs that 
run in highly constrained environments such as 
embedded systems. Simulators can capture more 
state and provide a wider range of facilities than in 
situ debuggers. For example, simulators can imple- 
ment program counter (PC) change queues, data 
access breakpoints, or precise traps on errors. 

■ Replicable event tracing. Most simulators are fully 
deterministic. Asynchronous events are scheduled 
based on simple, nonrandom algorithms, such as 
fixed time-out or calculated seek time. As a result, 
simulators allow for straightforward replication or 
playback of complicated sequences, removing the 
randomness factor that often plagues the debug- 
ging of asynchronous software on real systems. 

■ Preservation of past software. Simulators can pro- 
vide migration assistance in the transition from older 
to newer architectures. Manv transitional computer 
systems have provided simulators for older archi- 
tectures, typically at the microcode level, to assist 
customers and developers in preserving their invest- 
ments in the previous architecture. Examples 
include the early IBM Svstem/360 series, which had 
models that simulated the 1401, 1410, 7070, and 
7090 families, and the early Digital VAX systems, 
which included a PDP-11 compatibility mode. 1 "" 

Simulation Levels 

Simulators can be written at various levels of detail and 
thus various levels of fidelity to the target system. 
Three common levels of simulation are register trans- 
fer level (RTL), instruction, and software specific. 

An RTL simulator attempts to mimic the major 
hardware blocks of the target system and to imple- 
ment, its actual logic equations. The goal is absolute 



Dipirnl Technical Journal 



Vol 8 No. 3 1996 29 



fidelity, the test of which is that no piece of softw are 
running on the simulator should behave differently 
than it would on the target hardware. In practice, such 
perfect mimicry is difficult to achieve, as it requires .1 
painstaking re-creation of timing detail (for example, 
the actual acceleration curve of a DKCtapc storage 
system) and access to implementation documentation 
that has often vanished. Nonetheless, some simulators 
have achieved results very close to this goal: MIMIC, 
a DECsvstem-10 simulator written at Applied Data 
Research, was able to run CPU- and device-specific 
diagnostics. (As testimony to the vulnerability of 
computing's past, all machine-readable copies of the 
MIMIC] sources appeal' to have been lost.) 

An instruction simulator steps back from the RI 1. 
level and tries to simulate at the functional or the 
behavioral level. System elements are treated as func- 
tions that transform state according to the abstract 
definitions of the system architecture, rather than 
as logic blocks that transform state based on imple- 
mentation equations. Instruction simulators sacrifice 
absolute fidelity to the idiosyncrasies of a particular 
implementation and focus on the intentions of the 
architecture specification. As a result, instruction sim- 
ulators can usually run systems software and applica- 
tions but can rarely fool diagnostics. 

Finally, a software-specific simulation further 
abstracts the functions of the target system to only those 
needed by a particular piece of target system software. 
For example, the OS/8 operating system on the PDP-8 
computer does nor use program interrupts; a simulator 
aimed at running only the OS/8 operating system 
would not need to implement interrupts or even 
queued ev ents. A recent PDP-l 1 simulator designed to 
run the 2.9 BSD UNIX operating system abstracted 
parts of the PDP-1 1 system's interrupt model and could 
nor run other POP- 1 1 operating svstems.'- 

Simulating Minicomputers: A Case Study 

SIM is a portable instruction-lev el minicomputer sim- 
ulator implemented in C. Its objectives are to facilitate 
tlie study and use of historic computer architectures by 
making simulated implementations and historic soft- 
ware available to anyone who has a 32-bit computer. It 
supports the following target architectures 

■ PDP-R 

■ PDP-11 

■ Nova 

■ 1 S-bir PDP series ( PDP-4, PDP-7, PDP-9, POP- 1 5) 

and has been successflillv ported to the VAX VMS, the 
Alpha OpenVMS, the Digital UNIX, and the Linux 
architectures. Ports to the Windows NT and the 
Windows 95 architectures and to an IBM 1401 simu- 
lator are under wav. 



General Design Considerations The design of an 
instruction-level simulator is not technicallv compli- 
cated; indeed, simulating a PDP-8 svstem is a common 
problem in undergraduate computer science courses. 
SIM follows the processor-memory switch (PMS) 
structure proposed bv Bell and Newell and imple- 
mented in MIMIC and countless other simulators 
since. The simulated svstem is a collection of 
devices, one of which has special properties (the 
CPU). Each device has state (registers) and one or 
more units. Each unit has state and fixed- or variable- 
sized storage. In the CPU device, the storage is main 
memorv. In an I/O dev ice, the storage is the device 
media. The CPU is distinguished from other devices 
by having the master routine for instruction execu- 
tion. This routine is responsible for the sequential ev fi- 
liation of instructions and for the state transformations 
that represent simulated execution. The CPU also pro- 
vides a few svstemvv ide routines, such as svmbolic dis- 
assembly and input and a binary loader. 

The devices interface to a control panel that pro- 
vides access to simulated state and control over execu- 
tion. The available commands in SIM arc listed in 
Table 5. 

The control panel also includes routines that arc 
needed by most simulators, such as event queue main- 
tenance and character-by-character terminal I/O. 
Different simulators need not use the same time base, 
but all the SIM-bascd implementations to date use the 
number of instructions executed as the time base. 

Note that the control panel provides for starting sim- 
ulation, but termination is determined entirely by the 
simulated CPU. Bv convention, the CPU returns con- 
trol ro the control panel under the following conditions: 

1 . If a HALT instruction is executed 

2. I fa fatal exception is detected 

3. If a fatal I/O error is detected 

4. If a special character is tvped at the controlling 
terminal 

Likewise, the control panel docs not implement any 
debugging facilities beyond state examination and 
modification and instruction stepping. To facilitate 
debugging with operating systems, CPUs provide 
a simple instruction breakpoint capability and a one- 
level PC trace facility. 

Implementation The implementation of a particular 
simulator begins with collecting reference manuals, 
maintenance manuals, design documents, folklore, 
and prior simulator implementations for the target 
svstem. This is nonrrivial. In the carlv davs of comput- 
ing, companies did not systematically collect and 
archive design documentation. In addition, collected 
material is subject to information decav, as noted 
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Table 5 

Commands Available in SIM 



Command 


Definition 


^ttpirh <"iinit^> <"f!lp^> 

□ L LQVl 1 x U 1 III-'* S, 1 1 IC'' 


A^orip t p f i 1 p \A/i t h 1 1 n i+'c mpHi^ 

i\ j juv_ la ic iiic Willi ui 1 1 1 j i i icuia. 


Hotarh -'i mits 1 Al 1 

UcldLI 1 <^UMIL-- > 1 /-ALL 


niQ^Qririptp 1 1 n it'c i i nitc^ mp H i 3 f rnm 3 nw "f i 1 p 

l^ljCJj5ULIaLCUIllLj\ClllUlllLj/lllt;LJICJ 1 1 U 1 1 1 a 1 1 y MIC 


rpcpt <"Hp\/irp^> 1 Al 1 


Rpcpt c\pi\i'\rpi ItxW Hpvirpc^ 
r\tri>trL ucvilc yaii ucviLciy, 


In <"f ! 1 p^> 

IUQU V 1 1 1 C^" 


1 naH hinarv nrnnram from f ilp 

i— u cj v-J i-j i 1 i cj i y yj i \j ui u i i i i i \j 1 1 i i 1 1 \_ . 


boot <unit> 


Reset all devices and bootstrap from unit. 


run {<new PC>} 


Reset all devices and resume execution at the current PC {or new PC). 


go {<new PC>} 


Resume execution at the current PC {or new PC}. 


cont 


Resume execution at the current PC. 


step {<number>} 


Execute one instruction {or number instructions). 


examine <list> 


Display contents of list of memory locations or registers. 


iexamine <list> 


Display contents of list of memory locations or registers and allow interactive 




modification. 


Hpnptcit ^Mct^i. ips 
UtrUUbll <ui:>l-> <.valUtr-> 


^ti"\rp wa i ip in Met nf mpmnrv Inratinnc nr ronictorc 
j LUI tr vdlUtr 111 1 lb 1 UI 1 1 1 tr 1 1 (U I y 1 ULd U U l lb UI 1 cUIjICI y 


iHpnosit <"list> 


1 ntpractivpl v moHifv list of mpmnrv locations or rpoistprs 

IIILCI OV. Lively lit \J U 1 1 y 1 1 J I \J 1 1 1 1 C 1 1 1 KJ 1 y 1 \J v_ CJ L 1 \J 1 1 J \j 1 1 CUULCI j . 


cavp <"f I |p> 


Savpsimu atnrstatp in filp 

J U v L J 1 1 M U 1 □ I w 1 J LCI LL III NIC 


restore <file> 


Restore simulator state from file. 


chn\A/ ni ipi ip 
MlUVV LjUtrUtr 


riicniaw thp ci m i ila+nr'c p wpn t ni ipi ip 
UlbUlay 1 1 1 tr b 1 1 1 1 U 1 a LUI b tr v trll l L| U tr U tr . 


show configuration 


Display the simulator's configuration. 


showtime 


Display the simulated time counter. 


show <device> 


Show device's configuration options. 


set <device> <option> 


Set a device configuration option. 


help 


Display a terse help message. 


exit 1 quit 1 bye 


Leave the simulator. 



earlier. Lastly, the material is likelv to he contradictory, 
embodying differing revisions or versions of the archi- 
tecture, as well as errors that have crept in during the 
documentation process. 

J-'or Digital's 12-bit and 16-bit minicomputers, the 
tvpical hierarchy of documentation was the following: 

■ Processor Handbook. Providing an all-inclusive 
summary of the instruction set architecture, periph- 
erals, bus interface, and software, these paperback- 
size books are the most common form of system 
documentation but also the least accurate. 

■ Subsystem Reference Manual. As the programmer's 
reference manual for a particular subsystem, such as 
the CPU or the disk drive, these manuals describe 
the registers and functions accurately but omit 
maintenance-lev el features and other fine points. 

■ Subsystem Maintenance Manual. As the mainte- 
nance engineer's manual for a particular subsystem, 
these manuals describe the registers and functions 
at the hardware implementation level, often includ- 
ing substantial abstracts from the print set. Because 
of the level of derail, the maintenance manuals have 
proven to be the most useful references for simula- 
tor implementation. 



■ Design documents. For svstems that do not have 
very large-scale integration (VLSI), the onlv extant 
design documents are the logic prints and the binary 
microcode ROM listings. The prints are essential for 
RTL simulation: thev prov ide the onlv documenta- 
tion of implementation quirks. For VL.SI svstems, 
there are chip-level design specifications as well as 
human-readable microprogram listings. 

■ Folklore. During the useful lifetime of a system, its 
users exchange information and create an informal 
record, both written and verbal, of shared expe- 
riences (folklore) regarding the fine points of 
operations, hardware/software interfaces, system 
"personality'," and other factors. Folklore is subject 
to rapid information decav, particularly once the 
target system becomes obsolete. 

■ Prior implementations. Prior simulator implementa- 
tions can provide useful information, but it must be 
used cautiously. Unless the prior implementation is 
an RTL model, it embodies simplifications and 
abstractions that are nor explicitly documented. The 
MIMIC! sources (which are fragmentary and avail- 
able only on paper) proved trustworthy, but others 
did not: for example, the 1 97§s PDP-11 simulator 
in the DECUS archives is highly misleading about 
interrupts, condition codes, and other details. 
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An important consideration is that much of the 
documentation, all the folklore, and most working 
systems are in the hands of individual collectors. 
The Internet plays a vital role in locating material held 
by enthusiasts, through news conferences such as 
alt. folklore. computers, alt.svs.pdp8, alt.svs.pdpl 1, 
and comp. emulators. misc, and more recently, through 
World Wide Web sites devoted to historic systems. 14 ' 10 
The sources for each simulator in SIM are listed in 
Table 6. 

The last step in implementation is collecting soft- 
ware to run on the simulator. Software collection 
immediately raises the problem of media translation. 
Software for historic systems resides on paper tapes, 
DECtape storage systems, 200/556/800 bits-per- 
inch magnetic tapes, disk cartridges, 8-inch floppy 
disks, and so on. Few if any modern systems have these 
peripherals; and few if any historic systems have mod- 
ern network interconnects. Thus, media translation 
usually entails linking a working version of the target 
system to a modern system by means of a serial line. 
KERM1T or some other simple protocol allows for a 
byte-by-byte network copy from the original media to 
a file on a modern system. 

Once the software has been located and moved 
to a file, the next issue is sources. Without sources, 
diagnostics and other test programs are useless; 
detected errors cannot be traced back to causes with- 
out manual decode of the binary program. The 
absence of sources was a principal reason for including 
symbolic disassembly and input in SIM. 



The final issue in software is licensing. Kvcn though 
the target systems are obsolete and often no longer 
manufactured, the operating system software may be 
protected bv copyrights and licenses. Most TDP-S 
software is in the public domain; however, the PDP-1 1 
and Nova operating systems are still licensed, as arc- 
all versions of UNIX. Corporate licensing policies 
rarely accommodate hobbyists; this limits operating 
system distribution to legitimate (that is, business) 
users. Table 7 lists the sofnvare found for each simula- 
tor in SIM. 

Debug The debug path for a simulator depends 
on the available software. Ideally, the simulator would 
be debugged with the same software tests used 
to debug the target hardware, but this software is 
rarely archived. Diagnostics can provide low-level 
checking, but diagnostics typically check for broken 
parts in a correct implementation, rather than an 
incorrect implementation. Even when diagnostics 
do check architecture rather than implementation (as 
in the basic instruction diagnostics on the PDP-11 
system), the absence of sources limits their utility. 
Consequently, the simulators were debugged mostly 
with simple hand tests and then with the operating 
systems. 

Operating systems are both exacting and imprecise 
tests of implementation correctness. Unless an 
operating system takes a deliberately restrictive view 
of hardware (for example, OS/8 does not use the 
PDP-8 interrupt svstem, and RT-1 1 does not use anv 



Table 6 

Sources for Simulators in SIM 



Architecture 


Documents 


Location 


PDP-8 


Minicomputer Handbook 


Private collection 




Reference manuals 


Digital archive 




Maintenance manuals 


Digital Australia collection 




Print sets 


Digital Australia collection 




Prior implementations 


Public archive" 
Public archive' 8 
MIMIC, private collection 


PDP-11 


Minicomputer Handbook 


Private collection 




Reference manuals 


Digital archive 




Maintenance manuals 


Digital Australia collection 




Chip specifications 


Private collection 




Microcode listings 


Private collection 




Prior implementations 


Public archive" 




MIMIC, private collection 


Nova 


System Reference Manual 


Private collection 




Reference manuals 


Data General archive 




Maintenance manuals 


Private collection 




Prior implementations 


MIMIC, private collection 


18-bit PDP 


Reference manuals 


Digital archive 




Maintenance manuals 


Digital archive 




Print sets 


Digital archive 
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Table 7 

Software for Simulators in SIM 



Architecture 



PDP-8 



PDP-11 



Nova 

18-bit PDP 



Software 



Basic instruction tests 1 and 2 

Memory management test 

FOCAL69 

OS/8 system disk 

RT-11 

RSX-11M 

RSTS/E 

UNIX V5, V6, V7, 2.9 BSD 
2.11 BSD 

RDOS 

No software to date 



Location 



Digital Australia collection 
Digital Australia collection 
Digital Australia collection 
Public archive' 8 

Transcribed from real system 

Transcribed from real system 

Transcribed from real system 

PDP UNIX Preservation Society (PUPS) archive 2 ' 

Private collection 

Private collection 



optional PDP-1 1 instructions), the operating sys- 
tem will be sensitive to everv error in implementation. 
For example, Digital's second -generation PDP-11 
systems— the PDP-11/05, 11/40, and 11/45— 
w ere debugged with DOS- 11 and RSTS after diag- 
nostics failed to detect certain subtle implementation 
errors. Unfortunately, in an operating system, the 
distance in time and space between the error and the 
symptom may be enormous, and the traceable path 
may be lengthy and complicated. Artifacts in the 
software can also complicate debug: the OS/8 disk 
image on the Internet contains ;i copy of BASK' that 
is broken. 

Results SIM implements four minicomputer architec- 
tures: PDP-8, PDP-11, Nova, and 18-bit PDP. Each 
simulator includes a particular CPU; basic peripherals 
such as terminal, paper tape, clock, and printer; and 
a selection of mass storage peripherals (see Table 8). 

The PDP-8 simulator has run the FOCAL69 and 
the OS/8 operating systems. The PDP-11 simulator 
has run the following operating systems: RT-11 V4 
and V5; RS\'-11mV4; KS IS/; V8; UNIX V5, 
V6, and V7; and BSD V2.9 and V2.ll. The Nova 
simulator has run the RDOS V7.5 operating system. 
No system software for the 18-bit PDP systems 
has been found. The simulators were exercised on an 
AlphaStation 3000/600 workstation (approximately 
1 20 SPHCint92 ); the performance is given in Table 9. 

Figures 2, 3, and 4 show screen shots from the various 
simulators running their principal operating systems. 

In Defense of Computing's History 

As professional engineers who have been kicky 
enough to witness the computer revolution, the 
authors believe that the industry has a duty to keep 
early machines alive. There are practical reasons, such 



as preservation of software and data; beyond that, 
there is an obligation to future generations. In 100 
years, the systems from computing's early history will 
appear to be absolute dinosaurs of the past. Yet their 
educational and sociological value will be consider- 
able. A computer is a machine with a soul, and it must 
be kept alive with its operating environment to show 
its abilities and the contemporary state of the art. 
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Table 8 

Architectures Implemented by SIM 





PDP-8 




PDP-1 1 


Nova 


CPU 


PDP-8/E 




J-11, Q-bus 


Nova 820 


Options 


KE8E EAE, 




Integral FP11 


Multiply/divide 




KM8E memory extension 






Memory 


4-32 K words 




16KB-4 MB 


4-32K words 


Terminal 


KL8E 




DL1 1 


KSR-33, Dasher 


Papertape 


PC8E 




PC11 


Yes 


Clock 


DK8E 




KW11L 


Yes 


Printer 


LE8E 




LP1 1 


Yes 


Storage 






RX1 1/RX01 


4019 




RK8E/RK05 




RK1 1/RK05 


4046/4047, 4048, 




RF08/RS08 




RLV1 1/RL01.2 


4057, 4234 


Magnetictape 


TM8E/TU10 




TM11/TU10 


6026 




PDP-4 


pnp-7 


PDP-9 


PDP-1 5 


CPU 


PDP-4 


PDP-7 


PDP-9 


nr\n itr/T)r\ 

PDP-1 5/30 


Options 




T177 EAE, 


KE09A EAE, 


KE15 EAE, 






T148 memory 


NAUrM memory 


i\ivi \d memory 






extension 


protection 


protection 








KP09A power 


KP15 power 


Memory 


4-8K words 


4-32K words 


4-32K words 


4-1 28K words 


Terminal 


KSR-28 


KSR-33 


KSR-33 


KSR-35 


Papertape 


Integral 


T444 reader 


PC09A 


PCI 5 reader- 




T75 punch 


T75 punch 


reader- punch 


punch 


Clock 


Yes 


Yes 


Yes 


Yes 


Printer 


T62 


T647 


T647E 


LP15 


Storage 




T24 drum 


RF09/RS09 


RF15/RS09 










RP15/RP02 


Magnetic tape 






TC59/TU10 


TC59/TU10 



the hardware. In addition, Bill provided a working 
OS/8 system disk, and John copied several POP- 11 
operating svstem disks off a working PDP-11/34. 
(Megan Gentry was an important source of PDP-1 1 
folklore, debugged some of the subtlest problems, cre- 
ated the Makefile, and provided, the first and most 
frequently used distribution site. Ben Thomas 
provided the character- by-character I/O routines 
for VMS. Chris Suddick helped debug the PDP-11 
floating-point code. Warren Toomey and the enthusi- 
asts at PUPS (the PDP UNIX Preservation Society) 
in Australia allowed me access to their archive of early 
UNIX releases. Leendert Van Doom debugged 
the PDP-11 simulator with UNIX V6, and Franc 
Grootjen with 2.11 BSD. Larry Stewart provided the 
initial impetus to rhe project, and Ken Harrenstein 
made an important contribution to preservation 
bv implementing a DKCsystem-10 simulator. Last, 
but not least, Max Burner generously provided 
documentation and software from the Digital 



Australia collection, answered questions based on his 
30 years of experience with Digital's systems, and 
made connections with and introductions to the 
worldwide communitv of historic machine hobbyists 
and enthusiasts. 
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Table 9 

Simulator Performance 



Real 

Instructions 
per Second 



Ratio 



400,000 
500,000 
750,000 



4.5:1 
.88:1 
2.26:1 



Simulator 



Simulated 
Instructions 
per Second 



PDP-8 

PDP-11 

Nova 



1,800,000 
440,000 
1,700,000 



ucoder> pdp8 

PDP-8 simulator V2.2b 
sim> att rkO os8.dsk 
s i m> boot rkO 

.DA 08-APR-96 

. D I R 



08-Apr-96 



COPY I T 


S V 


2 


09 


-Mar 


-93 


P ASS2 


S V 


20 


1 1 


-Oct 


-92 


F0RT3 


LD 


3 


06 


-Jul 


-93 


DIRECT 


S V 


7 


1 1 


-Oct 


-92 


P ASS20 


SV 


5 


1 1 


-Oct 


-92 


CLOSE 


S V 


2 


1 0 


-Jul 


-93 


CCLX 


SV 


24 


25 


-Feb 


-93 


PASS3 


SV 


8 


1 1 


-Oct 


-92 


F0RT4 


FT 


1 


1 1 


-Jul 


-93 


P I P 


SV 


1 1 


1 1 


-Oct 


-92 


R ALF 


SV 


1 9 


1 1 


-Oct 


-92 


F0RT4 


LD 


2 


04 


-Aug 


-93 


F0TP 


SV 


8 


1 1 


-Oct 


-92 


RESORC 


SV 


10 


1 1 


-Oct 


-92 


F0RT6 


LD 


2 


09 


-Aug 


-93 


ABSLDR 


SV 


5 


1 1 


-Oct 


-92 


RUNOFF 


SV 


24 


1 1 


-Oct 


-92 


F0RT5 


FT 


1 


09 


-Aug 


-93 


BASIC 


S V 


1 1 


1 1 


-Oct 


-92 


SABR 


S V 


24 


1 1 


-Oct 


-92 


F0RT5 


LD 


2 


09 


-Aug 


-93 


BATCH 


S V 


1 0 


1 1 


-Oct 


-92 


SCROLL 


S V 


1 7 


1 1 


-Oct 


-92 


F0RT6 


FT 


1 


09 


-Aug 


-93 


BC0HP 


S V 


26 


1 1 


-Oct 


-92 


SET 


S V 


20 


1 1 


-Oct 


-92 


METSC 


SV 


1 0 


1 1 


-Aug 


-93 


BITMAP 


S V 


5 


1 1 


-Oct 


-92 


SRCCOM 


S V 


5 


1 1 


-Oct 


-92 


METSC2 


SV 


1 0 


1 1 


-Aug 


-93 


BL0AD 


S V 


1 0 


1 1 


-Oct 


-92 


TECO 


SV 


32 


1 1 


-Oct 


-92 


EMAT 


S V 


9 


1 1 


-Aug 


-93 


BOOT 


S V 


5 


1 1 


-Oct 


-92 


VERSN3 


SV 


1 0 


1 1 


-Oct 


-92 


EMDCT 


S V 


1 4 


1 1 


-Aug 


-93 


BRTS 


SV 


24 


1 1 


-Oct 


-92 


BUILD 


SV 


33 


1 1 


-Oct 


-92 


EMTST 


S V 


1 0 


1 1 


-Aug 


-93 


CHEKM0 


SV 


1 5 


1 1 


-Oct 


-92 


BASIC 


ov 


1 6 


1 1 


-Oct 


-92 


S INST1 


S V 


1 4 


1 1 


-Aug 


-93 


C0MPA F 


SV 


5 


1 1 


-Oct 


-92 


BUI LD6 


SV 


33 


1 1 


-Oct 


-92 


ADDER 


. S V 


1 3 


1 1 


-Aug 


-93 


CREF 


SV 


1 3 


1 1 


-Oct 


-92 


BUILT 


SV 


33 


1 2 


-Oct 


-92 


F0RT7 


. FT 


1 


30 


-Aug 


-93 


EDIT 


SV 


1 0 


1 1 


-Oct 


-92 


HELP 


H E 


1 


1 8 


-Oct 


-92 


CLEAR 


. LS 


2 


1 3 


-Jan 


-94 


EDITS 


SV 


6 


1 1 


-Oct 


-92 


HELP 


HL 


72 


1 8 


-Oct 


-92 


CLEAR 


. C F 


2 


1 3 


-Jan 


-94 


EPIC 


SV 


1 4 


1 1 


-Oct 


-92 


HELP 


OC 


4 


1 8 


-Oct 


-92 


CLEAR 


. SV 


2 


1 3 


-Jan 


-94 


F4 


SV 


20 


1 1 


-Oct 


-92 


F0RT7 


LD 


2 


07 


-Sep 


-93 


CLEAR 


. PA 


1 


1 3 


-Jan 


-94 


FRTS 


S V 


26 


1 1 


-Oct 


-92 


J MPTST 


SV 


3 


18 


-Oct 


-92 


CLEAR 


. BN 


2 


1 3 


-Jan 


-94 


FUTI L 


S V 


26 


1 1 


-Oct 


-92 


JMPJMS 


S V 


3 


18 


-Oct 


-92 


DEMO 




28 


21 


-Mar 


-95 


HELP 


S V 


5 


1 1 


-Oct 


-92 


RK8ENS 


BN 


1 


30 


-Oct 


-92 


DOS 


PA 


4 


25 


-Jan 


-94 


LIBRA 


S V 


1 1 


1 1 


-Oct 


-92 


I NST1 


SV 


1 4 


01 


-Dec 


-92 


DOS 


BN 


1 


25 


-Jan 


-94 


LIBSET 


SV 


5 


1 1 


-Oct 


-92 


I NST2 


S V 


1 1 


01 


-Dec 


-92 


DOS 


LS 


1 0 


25 


-Jan 


-94 


LOAD 


SV 


1 6 


1 1 


-Oct 


-92 


FORT 


FT 


1 


1 7 


-Jun 


-93 


SHELL 


PA 


1 


25 


-Jan 


-94 


LOADER 


SV 


1 2 


1 1 


-Oct 


-92 


FORT 


LD 


2 


09 


-Jul 


-93 


SHELL 


BN 


1 


25 


-Jan 


-94 


MATST 


SV 


9 


1 1 


-Aug 


-93 


F0RT2 


LD 


2 


09 


-Jul 


-93 


SHELL 


. LS 


2 


25 


-Jan 


-94 


MDTST 


SV 


1 4 


1 1 


-Aug 


-93 


F0RT2 


FT 


1 


22 


-Jun 


-93 


BASIC 


. ws 


1 


1 0 


-Ma r 


-94 


OCOMP 


S V 


8 


1 1 


-Oct 


-92 


DOS 


S V 


2 


25 


-Jan 


-94 


FOO 


. PA 


1 


31 


-Ma r 


-94 


OPT F4 


S V 


1 3 


1 1 


-Oct 


-92 


SHELL 


S V 


2 


25 


-Jan 


-94 


FOO 


. BN 


1 


31 


-Ma r 


-94 


PAL8 


SV 


1 9 


1 1 


-Oct 


-92 


F0RT3 


FT 


1 


26 


-Jun 


-93 
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Simulation stopped, PC: 01207 (KSF) 
s i m> 



Figure 2 

PDP-8 Simulator Running OS/8 
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ucoder> nova 



NOVA simulator V2.2b 

sim> att dpO rdos.dsk 

sim> set tti dasher 

sim> boot dpQ 

F i L e n a m e ? 

NOVA RDOS Rev 7.50 
Date (m/d/y) ? 4 8 96 
Time (h:m:s) ? 16 26 0 



R 

List/e sys-.- 



SYS5 . LB 
SYS . SV 
SYS . LB 
SYS . OL 
SYSGEN. SV 
R 


17216 
56320 
20240 
30720 
23040 


D 

SD 
D 

C 

SD 


05/24/77 
12/14/95 
04/30/85 
12/14/95 
05/02/85 


13:18 
16:21 
14:49 
16:21 
22:20 


05/31/85 
12/14/95 
05/31/85 
12/14/95 
05/31/85 


[001 01 71 
C005057] 
C000746] 
[005272] 
[001 401 ] 


0 
0 
0 
0 
0 


disk 

LEFT: 2158 
R 


USED: 2706 


MAX 


CONTIGUOUS 


2054 








Simulation 


stopped, PC : 


41740 


(LDA 1,4,3) 











s i m> 



Figure 3 

Nova Simulator Running RDOS 



5. A. Alii, (J. Burroughs, A. Gore, S. I .a Mar, C.-Y. Lin, 
and A. Wicmann, "Design Verification of the HP 9000 
Series 700 PA-RISC Workstations," lleirlell-l'ack'.arcl 
Journal, vol. 43, no. 4 ( 1992). 

6. W. Anderson, "Logical Verification of the NVAX CPU 
Chip Design," Digital Technical Journal, vol. 4, 
no. 3 (1992): 38-46. 
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ucoder> pdp11 

PDP-11 simulator V2 . 2b 
sim> att rkO rtrk.dsk 
sim> boot rkO 

RT-11SJ (S) V05 . 04 



.da 8-apr-96 
. d i r 



08-Apr-96 



N L 


.SYS 


2 


1 8-Sep 


-89 


RT1 1 FB 


.SYS 


94 


1 8 


— Sep 


-89 


R T 1 1 S J 


.SYS 


80 


1 8-Sep 


-89 


SPOOL 


. RE L 


1 1 


1 4 


-Apr 


-87 


PTESTX 


.MAC 


23 


27 - J a n 


-94 


G VI 


. SAV 


5 


1 8 


-Apr 


-90 


B I N C OM 


.SAV 


24 


27-Sep 


-88 


D U P 


.SAV 


49 


27 


— Sep 


-88 


D I R 


.SAV 


1 9 


27-Sep 


-88 


I N D 


.SAV 


58 


27 


-Sep 


-88 


LIBR 


. S A V 


24 


27-Sep 


-88 


MACRO 


.SAV 


61 


27 


-Sep 


-88 


LINK 


. SAV 


49 


27-Sep 


-88 


RESORC 


.SAV 


25 


27 


-Sep 


-88 


FORMAT 


. SAV 


24 


27-Sep 


-88 


ODT 


.SAV 


8 


05 


-Oct 


-89 


PBCOPY 


.SAV 


2 


1 6-Feb 


-89 


SYSLIB 


.OBJ 


55P 


05 


-Oct 


-89 


ODT 


. OBJ 


8 


05-0ct 


-89 


SYSMAC 


. SML 


61 


1 6 


-Mar 


-89 


S I PP 


.SAV 


21 


27-Sep 


-88 


DATE 


.SAV 


3 


02 


-Feb 


-89 


I OP 


.SAV 


1 1 


2 4 - A p r 


-89 


SWAP 


.SYS 


2 7 


2 7 


— Sep 


-88 


TT 


.SYS 


2 


1 8-Sep 


-89 


DL 


.SYS 


4 


1 8 


— Sep 


-89 


DM 


.SYS 


5 


1 8-Sep 


-89 


DP 


.SYS 


3 


1 8 


— Sep 


-89 


DX 


.SYS 


4 


1 8-Sep 


-89 


RK 


.SYS 


3 


1 8 


— Sep 


-89 


LS 


.SYS 


5 


05-0c t 


-89 


MT 


.SYS 


9 


1 8 


— Sep 


-89 


LP 


.SYS 


2 


1 8-Sep 


-89 


SP 


.SYS 


6 


1 8 


— Sep 


-89 


PIP 


.SAV 


30 


27-Sep 


-88 


HANDLE 


.SAV 


7 


1 6 


-Feb 


-89 


LD 


.SYS 


8 


26-Dec 


-90 


MAC 


.SAV 


61 


27 


— Sep 


-88 


LC 


.SYS 


2 


01 -Jan 


-80 


UCL 


.SAV 


1 3 


2 2 


-Dec 


-89 


UCL 


. CCL 


4 


07-0ct 


-90 


STARTS 


.COM 


1 


1 9 


— Jan 


-94 


MTPIP 


.SAV 


28 


27-Feb 


-87 


MTROL 


.SAV 


1 7 


27 


-Feb 


-87 


MLIB 


.SYS 


300 


20-Dec 


-90 


HELP 


.SAV 


1 32 


20 


-Dec 


-90 


XPC 


.SAV 


1 6 


25- J un 


-91 


DESS 


.SAV 


1 8 


09 


-Mar 


-88 


PTESTX 


.OBJ 


8 


















49 Files, 


1 432 Blocks 
















3330 


Free 


blocks 


















. s h o d e v 




















D e v i c e 




Status 




CSR 


Vector(s) 










NL 




Installed 


000000 


000 












TT 




Installed 


000000 


000 












DL 




Installed 


174400 


160 












DM 




Not installed 


1 77440 


210 












D P 




Not installed 


1 7671 0 


254 












DX 




Installed 


177 170 


264 












RK 




Resident 


1 77400 


220 












LS 




Not ins 


tailed 


1 76500 


470 474 300 


304 








MT 




Installed 


172520 


224 












LP 




Installed 


177514 


200 












SP 




Installed 


000000 


1 1 0 












LD 




Installed 


000000 


000 












LC 




I n s t a I led 


177514 


200 













Simulation stopped, PC: 146506 (ASR R5) 
s i m> 



Figure 4 

POP- 1 I Simulator Running IV!'- 1 I 
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Biographies 




Maxwell M. Burnet 

Max Burnet lias been with Digital in Australia for 29 years. 
During that time, he has sold, serviced, or marketed all the 
machines in the collection. He managed the Digital 
Australia subsidiary for seven vcars. He w as a salesman 
in Boston during 1971 and managed to replace an IBM 
1620 at Tufts University with a POP- 10. He is currently 
the oldest surviving "tcchie" in the Sydney office and 
makes manv corporate presentations in Australia. He 
manages the Australian DKCUS Society, the Subsidiary's 
local content and export obligations with the Australian 
Government, and the local Product Assurance Group. 
He has collected a museum of early Digital machines and 
is known ar»und Sydney as "Museum Max." He received 
a R.Sc. (honours) from Melbourne University. 




Robert M. Supnik 

Bob Supnik has been with Digital in the United States 
for 19 years. He joined the Mass Storage Group and then 
moved into Semiconductor Engineering, where he succes- 
sively managed the last PDP- 1 1 implementation (the )- 1 1 ), 
Advanced Development, the first single-chip VAX imple- 
mentation (the Micro VAX chip), and the VAX Micro- 
processor Group. He also wrote or contributed to the 
microcode of every single-chip VAX microprocessor. In 
1988, he started the Alpha program, which he managed 
through launch of the first products in 1992. He then 
became technical director, first of Engineering and then 
of the Computer Systems Division. In 1996, he became 
vice president of Research and Advanced Development. 
He has B.A. degrees in mathematics and in history from 
MIT, and an M.A. in history from Brandeis University. 
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William N. Celmaster 



Modern Fortran 
Revived as the 
Language of Scientific 
Parallel Computing 



New features of Fortran are changing the way 
in which scientists are writing and maintaining 
large analytic codes. Further, a number of these 
new features make it easier for compilers to 
generate highly optimized architecture-specific 
codes. Among the most exciting kinds of 
architecture-specific optimizations are those 
having to do with parallelism. This paper 
describes Fortran 90 and the standardized 
language extensions for both shared-memory 
and distributed-memory parallelism. In par- 
ticular, three case studies are examined, show- 
ing how the distributed-memory extensions 
(High Performance Fortran) are used both for 
data parallel algorithms and for single-program- 
multiple-data algorithms. 



A Brief History of Fortran 

The Fortran (FORmula TRANslating) computer lan- 
guage was the result of a project begun by John 
Backus at IBM in 1954. The goal of this project was to 
provide a w ay for programmers to express mathemati- 
cal formulas through a formalism that computers could 
translate into machine instructions. Initially there was 
a great deal of skepticism about the efficacy of such 
a scheme. "How," the scientists asked, "would anyone 
be able to tolerate the inefficiencies that would result 
from compiled code?" But, as it turned out, the first 
compilers were surprisingly good, and programmers 
were able, for the first rime, to express mathematics in 
a high-level computer language. 

Fortran has evolved continually over the years in 
response to the needs of users, particularly in the areas 
of mathematical expressivity, program maintainability, 
hardware control (such as I/O), and, of course, code 
optimizations. In the meantime, other languages such 
as C and C+ + have been designed to better meet the 
nonmathematical aspects of software design, such as 
graphical interf aces and complex logical layouts. These 
languages have caught on and have gradually begun to 
erode the scientific/engineering Fortran code base. 

By the 1980s, pronouncements of the "death of 
Fortran" prompted language designers to propose 
extensions to Fortran that would incorporate the best 
features of other high-level languages and, in addition, 
provide new levels of mathematical expressivity popu- 
laron supercomputers such as the CYBER 205 and the 
CRAY systems. This language became standardized as 
Fortran 90 (ISO/IHC 1539: 1991; ANSI X3.198- 
1992). At the present time, Fortran 95, which 
includes many of the parallelization features of High 
Performance Fortran discussed later in this paper, is in 
the final stages of standardization. It is nor yet clear 
whether the modernization of Fortran can, of itself, 
stem the C tide. However, I will demonstrate in this 
paper that modern Fortran is a viable mainstream lan- 
guage for parallelism. It is true that parallelism is not 
yet part of the scientific programming mainstream. 
However, it seems likclv that, with the scientists' 
never-ending thirst for affordable performance, paral- 
lelism will become much more common — especially 
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now rh.it appropriate standards have evolved. Just as 
early Fortran enabled average seientists and engineers 
to program the computers of the 1960s, modern 
Fortran may enable average seientists and engineers to 
program parallel computers of the next decade. 

An Introduction to Fortran 90 

Fortran 90 introduces some important capabilities in 
mathematical expressivity through a wealth of natural 
constructs for manipulating arrays.' In addition, 
Fortran 90 incorporates modern control constructs 
and up-to-date features for data abstraction and data 
hiding. Some of these constructs, for example, DC) 
WHILE, although not part of F'ORTKAN 77, are 
already part of the de facto Fortran standard as pro- 
vided, for example, with DF.C Fortran. 

Among the kev new features of Fortran 90 are the 
following: 

■ Inclusion of all of FORTRAN 77, so users can 
compile their FORTRAN 77 codes without 
modification 

■ Permissibility of free-form source code, so pro- 
grammers can use long (i.e., meaningful) variable 
names and are not restricted to begin statements 
in column 7 

■ Modern control structures like CASH and DO 
VVHII.F., so programmers can take advantage of 
structured programming constructs 

■ Extended control of numeric precision, for archi- 
tecture independence 

■ Array processing extensions, for more easily express- 
ing array operations and also for expressing inde- 
pendence of element operations 

■ Pointers, for more flexible control of (.lata placement 

■ Data structures, for data abstraction 

■ User-defined types and operators, for data 
abstraction 

■ Procedures and modules, to help programmers 
write reusable code 

■ Stream character-oriented input/output features 

■ New intrinsic functions 

With these new features, a modern Fortran pro- 
grammer can not only successfully compile and exe- 
cute prev ious standards-compliant Fortran codes but 
also design better codes with 

■ Dramatically simplified ways of doing dynamic 
memory management 

■ Dynamic memory allocation and deallocation for 
memory management 

■ Better modularity and therefore reusability 



■ Better readability 

■ Easier program maintenance 

Additionally, of course, programmers have the 
assurance of complete portability between platforms 
.indarchitecturcs. 

The following code fragment illustrates the simplic- 
ity of dynamic memory allocation with Fortran 90. It 
also includes some ofrhc new syntax for declaring vari- 
ables, some examples of array manipulations, and an 
example of how to use the new intrinsic matrix multi- 
plication function. In addition, the exclamation mark, 
which is used to begin comment statements, is a new- 
Fortran 90 feature that was widely used in the past as 
an extension to FORTRAN 77. 



REAL, DIMENSION : , : , ■ ), 
S ALLOCATABLE :: GRID 

REAL*8 A(4,4),B(4,4),C(4,4) 
READ *, N 

ALL0CATE(GRID(N+2,N+2,2)) 
GRID( : , : ,1 ) = 1.0 
GRIDC :,:,2) =2.0 
A = GRIDd :4,1 :4,1 ) 
B = GRID(2:5,1 :4,2) 
C = MATMU L ( A , B ) 



! NEW DECLARATION SYNTAX 

! DYNAMIC STORAGE 

1 OLD DECLARATION SYNTAX 

! READ IN THE DIMENSION 

! ALLOCATE THE STORAGE 

! ASSIGN PART OF ARRAY 

! ASSIGN REST OF ARRAY 

! ASSIGNMENT 

1 ASSIGNMENT 

! MATRIX MULTIPLICATION 



Some of the new features of Fortran 90 were intro- 
duced nor only for simplified programming but also 
to permit better hardware-specific optimizations. 
For example, in Fortran 90, one can write the array 
assignment 

A = B + C 

which in FORTRAN 77 would be written as 

DO 100 J = 1,N 
DO 200 I = 1 ,M 

AC I, J ) = B(I,J) + C(I,J) 

200 END DO 

100 END DO 

The Fortran 90 array assignment not only is more 
elegant but also permits the compiler to easily recog- 
nize that the indiv idual element assignments are inde- 
pendent of one another. If the compiler were targeting 
a vector or parallel computer, it could generate code 
that exploits the architecture bv taking advantage of 
this independence between iterations. 

Of course, the particular DO loop shown above is 
simple enough that many compilers would recognize 
the independence of iterations and could therefore 
perform the architecture-specific optimizations with- 
out the aid of Fortran 90\s new array constructs. But 
in genera), many of the new features of Fortran 90 
help compilers to perform architecture-specific opti- 
mizations. More important, these features help pro- 
grammers express basic numerical algorithms in ways 
inherently more amenable to optimizations that take 
advantage of multiple arithmetic units. 
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A Brief History of Parallel Fortran: PCF and HPF 

During the past ten years, two significant efforts have 
been undertaken to standardize parallel extensions to 
Fortran. The first of these was under the auspices of 
the Parallel Computing Forum (PCF) and targeted 
global -shared-mcmorv architectures. The PCF effort 
was directed to control parallelism, with little atten- 
tion to language features for managing data locality. 
The 1991 PCF standard established an approach to 
shared-memory extensions of Fortran and also estab- 
lished an interim syntax. These extensions were later 
somewhat modified and incorporated in the standard 
extensions now known as ANSI X3H5. 

At about the time the ANSI X3H5 standard 
was adopted, another standardization committee 
began work on extending Fortran 90 for distributcd- 
memory architectures, with the goal of providing 
a language suitable for scalable computing. This 
committee became known as the High Performance 
Fortran Forum and produced in 1993 the High 
Performance Fortran (FIPF) language specification. ' 
The HPF programming-modcl target was data paral- 
lelism, and many data placement directives are pro- 
vided for the programmer to optimize data locality. In 
addition, HPF includes ways to specify a more general 
style ofsingle-program-multiple-data (SPMD) execu- 
tion in which separate processors can independently 
work on different parts of the code. This SPMD speci- 
fication is formalized in such a way as to make the 
resulting code far more maintainable than previous 
mcssagc-passing-library ways of specifying SPMD dis- 
tributed parallelism. 

Can HPF and PCF extensions be used together in 
the same Fortran 90 coder Sure. Rut the PCF specifi- 
cation has lots of "user-beware" warnings about the 
correct usage of the PARALLEL REGION construct, 
and the HPF specification has lots of warnings about 
the correct usage of the EXTRINSIQHPEJ OCAL.) 
construct. So as you can see, there are times when 
a programmer had better be v ery knowledgeable if she 
or he wants to write a mixed FIPF/PCF code. Digital's 
products support both the PCF and H PF extensions. 
The HPF extensions are supported as part of the DEC 
Fortran 90 compiler, and the PCF extensions are sup- 
ported through Digital's RAP Fortran optimizer.'" 1 

Shared Memory Fortran Parallelism 

The traditional discussions of parallel computing focus 
rather hcavilv on what is known as control parallelism. 
Namely, the application is analyzed in terms of the 
opportunities for parallel execution of various threads 
of control. The canonical example is a DO loop in 
which the individual iterations operate on inde- 
pendent data. Each iteration could, in principle, be 



executed simultaneously (provided of course that the 
hardware allows simultaneous access to instructions 
and data). Technology has evolved to the point at 
which compilers are often able to detect these kinds 
of parallelization opportunities and automatically 
decompose codes. Even when the compiler is not able 
to make this analysis, the programmer often is able to 
do so, perhaps after performing a few algorithmic 
modifications. It is then relatively easy to provide lan- 
guage constructs that the user can add to the program 
as parallelization hints to the compiler. 

This kind of analysis is all well and good, provided 
that data can be accessed democratically and quickly by 
all processors. With modern hardware clocked at about 
300 megahertz, this amounts to saving that memory 
latencies are lower than 100 nanoseconds, and memory 
bandwidths are greater than 100 megabytes per sec- 
ond. This characterizes today's single and symmetric 
multiprocessing (SMP) computers such as Digital's 
AlphaServer 8400 system, which comes with twelve 
600 -mega Hop processors on a backplane with a band- 
width of close to 2 gigabytes per second. 

In summary, the beauty of shared-memory paral- 
lelism is that the programmer does not need to worry 
too much about where the data is and can concentrate 
instead on the easier problem of control parallelism. In 
the simplest cases, the compiler can automatically 
decompose the problem without requiring any code 
modifications. For example, automatic decomposition 
for SMP systems of a code called, for example, cfd.f, 
can be done trivially with Digital's KAP optimizer by 
using the command line 

kf90 -f kapa rgs= 1 -cone ' cfd.f -o cfd.exe 

As an example of guided automatic decomposition, 
the following shows how a KAP parallelization asser- 
tion can be included in the code. (Actually, the code 
segment below is so simple that the compiler can auto- 
matically detect the parallelism without the help of the 
assertion .) 

C*$* ASSERT DO (CONCURRENT) 
DO 1 00 I = 4,N 

A ( I ) = B( I) + C ( I ) 
END DO 

For explicit control of the parallelism, PCF direc- 
tives can be used. In the example that follows, the KAP 
preprocessor form of the PCF directives are used to 
parallelize a loop. 

C*KAP*PARALLEL REGION 
C*KAP*SSHARED(A,B,C) LOCAL(I) 
C*KAP*PARALLEL DO 

DO 10 I = 1 ,N 

A(I) = B(I) + CCD 
10 CONTINUE 
C*KAP*END PARALLEL REGION 
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Cluster Fortran Parallelism 

High Performance Fortran VI. 1 is currently the onlv 
language standard for distributed-memorv parallel 
computing. The most significant way in w hich HPF 
extends Fortran 90 is through a rich family of data 
placement directives. There are also library routines 
and some extensions for control parallelism. HPF 
is the simplest way of parallelizing data-parallel appli- 
cations on clusters (also known as "farms") of work- 
stations and servers. Other methods of cluster 
parallelism, such as message passing, require more 
bookkeeping and are therefore less easy to express and 
less easy to maintain. In addition, during the past year, 
HPF has become widely available and is supported on 
the platforms of all major vendors. 

HPF is often considered to be a data parallel lan- 
guage. That is, it facilitates paralleliz.ition of arrav- 
bascd algorithms in which the instruction stream can 
be described as a sequence of array manipulations, 
each of which is inherently parallel. What is less well 
known is that HPF also provides a powerful way of 
expressing the more general SPMD parallelism men- 
tioned earlier. This kind of parallelism, often expressed 
with message-passing libraries such as MP!;' is one in 
which individual processors can operate simultane- 
ously on independent instruction streams and gener- 
ally exchange data either by explicitly sharing mcmorv 
or bv exchanging messages. Three case studies follow 
which illustrate the data parallel and the SPMD stvles 
of programming. 

A One-dimensional Finite-difference Algorithm 

Consider a simple one-dimensional grid problem — 
the most mind-boggJingly simple illustration of HPF 
in action — in which each grid value is updated as a lin- 
ear combination of its (prev ious) nearest neighbors. 
For each interior grid index /. the update algorithm is 

Y(/)-X(/- 1) + X(/ + 1) -2 X X(/) 

In Fortran 90, the resulting DO loop can be 
expressed as a single arrav assignment. How would 
this be parallelized? The simplest way to imagine paral- 
lelization would be to partition the X and Y arrays into 
equal-size chunks, with one chunk on each processor. 
Each iteration could proceed simultaneously, and at 
the chunk boundaries, some communication would 
occur between processors. The HPF implementation 
of this idea is simply to add the Fortran 90 code to two 
data placement statements. One of these declares that 
the X array should be distributed into chunks, or 
blocks. The other declares that the Y arrav should be 
distributed such that the elements align to the same 
processors as the corresponding elements of the X 
arrav. The resultant code for arrays with 1,000 ele- 
ments is as follows: 



!HPF$ DISTRIBUTE X(BLOCK) 
!HPFS ALIGN Y WITH X 

REAL*8 XCI000), VdOOO) 

initialize x> 

Y<2:999> = X(1:998) + X(3:100D) - 2 * X(2:999) 

<check the answer> 
END 

The HPF compiler is responsible for generating all of 
the boundary-element communication code. The com- 
piler is also responsible for determining the most even 
distribution of arravs. (If, for example, there were 13 
processors, some chunks would be bigger than others.) 

This simple example is useful not only as an illustra- 
tion of the power of HPF but also as a way of pointing 
to one of the hazards of parallel algorithm develop- 
ment. Each of the element-updates involves three 
floating-point operations — an addition, a subtraction, 
and a multiplication. So, as an example, on a four- 
processor system, each processor would operate on 
250 elements with 750 floating-point operations. In 
addition, each processor would be required to com- 
municate one word of data for each of the two chunk 
boundaries. The time that each of these communica- 
tions takes is know n as the communications latency. 
Tvpical transmission control protocol/internet proto- 
col (TCP/IP) network latencies are twenty thousand 
times (or more) longer than the time it typically takes 
a high-performance system to perform a floating- 
point operation. Thus even 750 floating-point opera- 
tions are negligible compared with the rime taken to 
communicate. In the above example, network paral- 
lelism would be a net loss, since the total execution 
time would be totally swamped by the network 
latency. 

Of course, some communication mechanisms are of 
lower latency than TCP/IP networks. As an example, 
Digital's implementation of MEMORY CHANNEL 
cluster interconnect reduces the latency to less than 
1000 floating-point operations (relative to the perfor- 
mance of, say, Digital's AlphaStation 600 5/300 sys- 
tem). For SMP, the latency is even smaller. In both 
cases, there may be a benefit to parallelism. 

A Three-dimensional Red-Black Poisson 
Equation Solver 

The example of a one-dimensional algorithm in the 
previous section can be easily generalized to a more 
realistic three-dimensional algorithm for solving 
the Poisson equation using a relaxation technique 
commonly known as the red-black method. The 
grid is partitioned into two colors, following a two- 
dimensional checkerboard arrangement. Each red 
grid element is updated based on the values of neigh- 
boring black elements. A similar array assignment can 
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be written as in the previous example or, as shown in 
the partial eode segment below, alternatively can use 
the HPF FORALL construct to express the assign- 
ments in a style similar to that for serial DO loops. 

!HPF$ DISTRIBUTED, BLOCK, BLOCK) :: U,V 
<other s t u f f > 

FORALL (I=2;NX-1,J=2:NY-1 : 2,K = 2 : NZ-1 :2) 
U ( I , J , K ) = FACTOR* ( HSQ* F ( I , J , K ) + S 
U(I-1,J,K) + U(I+1,J,K) + 8 

The distribution directive lays out the array so that 
the first dimension is completely contained within 
a processor, with the other two dimensions block- 
distributed across processors in rectangular chunks. 
The red-black checkerboarding is performed along 
the second and third dimensions. Note also the 
Fortran 90 free-form syntax employed here, in which 
the ampersand is used as an end -of- line continuation 
statement. 

In this example, the parallelism is similar to that 
of the one-dimensional finite-difference example. 
However, communication now occurs along the two- 
dimensional boundaries between blocks The HPF 
compiler is responsible for these communications. 
Digital's Fortran 90 compiler performs several opti- 
mizations of those communications. First, it pack- 
ages up all of the data that must be communicated 
into long vectors so that the start-up latency is effec- 
tively hidden. Second, the compiler creates so-called 
shadow edges (processor-local copies of nonlocal 
boundary edges) for the local arrays so as to minimize 
the effect of buffering of neighbor values. These kinds 
of optimizations can be extremely tedious to message- 
passing programmers, and one of the virtues of a high- 
level language like HPF is that the compiler can take 
care of the bookkeeping. Also, since the compiler 
can reliably do buffer-management bookkeeping (for 
example, ensuring that communication buffers do not 
overflow), the communications runtime library can 
be optimized to a far greater extent than one would 
normally expect from a user-safe message library. 
Indeed, Digital's HPF communications are performed 
using a proprietary optimized communications library, 
Digital's Parallel Software Environment." 



Communications and SPMD Programming with HPF 

Since HPF can be used to place data, it stands to 
reason that communication can lie forced between 
processors. The beautv of HPF is that all of this can be 
done in the context of mathematics rather than in the 
context of distributed parallel programming. The 
code fragment in Figure 1 illustrates how this is done. 

On two processors, the two columns of the U and V 
arrays are each on different processors; thus the array 
assignment causes one of those columns to be moved 
to the other processor. This kind of an operation begins 
to provide programmers with explicit ways to control 
data communication and therefore to more explicitly 
manage the association of data and operations to 
processors. Notice that the programmer need not be 
explicit about the parallelism. In fact, scientists and 
engineers rarely want to express parallelism. In typical 
message-passing programs, the messages often express 
communication of vector and array information. 

However, despite the fervent hopes of programmers, 
there are times when a parallel algorithm can be 
expressed most simply as a collection of individual 
instruction streams opcratingon local data. ThisSPMD 
style of programming can be expressed in HPF with the 
EXTRJNSIC(HPF_LOCAI.) declaration, as illustrated 
by continuing the above code segment as shown in 
Figure 2. 

Because the subroutine CFD is declared to be 
EXTRINSIC:(HPF_LOCAL), the HPF compiler exe- 
cutes that routine independently on each processor (or 
more generally, the execution is done once per pew 
process), operating on routine-local data. As lor the 
arrav argument, V, which is passed to the CFD routine, 
each processor operates only on its local slice of that 
array. In the specific example above on two processors, 
the first one operates on the first column of V and the 
second one operates on the second column of V. 

It is important to mention here that, although HPF 
permits — and even encourages — SPMD program- 
ming, the more popular method at this time is the 
message-passing technique embodied in, for example, 
the PVM 7 and MPT libraries. These libraries can be 
invoked from Fortran, and can also be used in conjunc- 
tion with HXT.RiNSlC( H PF_LOCAL) subroutines. 



IHPFS DISTRIBUTED, BLOCK) :: U 
!HPF$ ALIGN V WITH U 

R E A L * 8 U(N,2),V(N,2) 

initialize arrays> 

V(:,1) = U(:,2) 1 MOVE A VECTOR BETWEEN PROCESSORS 



Figure 1 

Code Example Showing Control of Data Communication without Expression of Parallelism 
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CALL CFD(V) ! DO LOCAL WORK ON THE LOCAL PART OF V 

<finish the main program> 

EXTRINSIC(HPF_LOCAL> SUBROUTINE CFD(VLOCAL) 
REAL*8, DIMENSION : , : ) :: VLOCAL 
IHPFI DISTRIBUTE *<*,BLOCK> :: VLOCAL 

<do arbitrarily complex work with v I o c a I > 
END 



Figure 2 

Code Example of Parallel Algorithm Expressed as Collection of Instruction Streams 



Clusters of SMP Systems 

During these last few years of the second millennium, 
we arc witnessing the emergence of systems that con- 
sist of clusters of shared -memory SMP computers. 
This exciting development is the logical result of the 
exponential increase in performance of mid-priced 
(S100K to S1000K) systems. 

There are two natural ways of writing parallel 
Fortran programs for clusters of SMI 1 systems. The 
easiest way is to use HPFand to target the total num- 
ber of processors. So, for example, if there were two 
SMP systems, each with four processors, one would 
compile the HPF program for eight processors (more 
generally, for eight peers). If the program contained, 
for instance, block-distribution directives, the af fected 
arrays would be split up into eight chunks of contigu- 
ous array sections. 

The second way of writing parallel Fortran pro- 
grams for clustered SMP systems is to use HIT' to 
target the total number of SMP machines and then 
to use PCF (or more generally, shared-memory exten- 
sions) to achieve parallelism locally on each of the SMP 
machines. For example, one might write 

'HPF$ DISTRIBUTE (*, BLOCK) :: V 
<s t uf f > 

EXTRINSIC(HPF_LOCAL) SUBROUTINE CFD(V) 
< s t u f f > 
C*KAP*PARALLEL REGION 

If the target system consisted of two SMP systems, 
each with four processors, and the above program was 
compiled for two peers, then the V array would be dis- 
tributed into two chunks of columns one chunk 
per SA4P system. Then the routine, (TO, would be 
executed once per SMP system; and the PCF directives 
would, on each system, cause parallelism on four 
threads of execution. 

It is unclear at this time whether there would ever 
be a practical reason for using a mix of HPF and PCF 
extensions. It might be tempting to think that there 
would be performance advantages associated with the 
local use of shared-memory parallelism. However, 
experience has shown that program performance 
tends to be restricted by the weakest link in the perfor- 
mance chain (an observation that has been enshrined 



as "Amdahl's Law"). In the ease of clustered SMP sys- 
tems, the weak link would be the inter-SMP commu- 
nication and not the intra-SMP (shared -memory) 
communication. This casts some doubt on the worth 
of local communications optimizations. Experimenta- 
tion will be necessary. 

Whatever else one might say about parallelism, one 
thing is certain: The future will not be boring. 

Summary 

Fortran was dev eloped and has continued to evolve as 
a computer language that is particularly suited to 
expressing mathematical formulas. Among the recent 
extensions to Fortran are a variety of constructs for 
the high-lev el manipulation of arrays. These constructs 
are especially amenable to parallel optimization. In 
addition, there are extensions (PCF) for explicit 
shared -memory paralleli/.ation and also data-parallel 
extensions (HPF) for cluster parallelism. The Digital 
Fortran compiler performs many interesting optimiza- 
tions of codes written using HPF. These HPF codes 
are able to hide — without sacrificing performance — 
much of the tedium that otherwise accompanies clus- 
ter programming. Todav, the most exciting frontier 
f»r Fortran is that of SMP clusters and other 
non uniform- memory-access ( NUMA) systems. 
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Digital Equipment Corporation and Oracle 
Corporation have announced a new TPC-C 
performance record in the competitive mar- 
ket for database applications and UNIX ser- 
vers on the AlphaServer 8400 5/350 four-node 
TruCluster system. A performance evaluation 
strategy enabled Digital to achieve record- 
setting performance for this TruCluster con- 
figuration supporting the Oracle Parallel Server 
database application under the TPC-C workload. 
The system performance in this environment is 
a result of tuning the system under test and tak- 
ing advantage of TruCluster features such as the 
MEMORY CHANNEL interconnect and Digital's 
distributed lock manager and distributed raw 
disk service. 



Current industry trends have moved from centralized 
computing offered by uniprocessors and symmetric 
multiprocessing (SMP) svstems to multinode, highly 
available and scalable systems, called clusters. The 
TruCluster multicomputer system for the Digital 
UNIX environment is the latest cluster product from 
Digital Equipment Corporation. 1 In this paper, we 
discuss our test and results on a four-node AlphaServer 
8400 5/350 TruCluster configuration supporting the 
Oracle Parallel Server database application. We evalu- 
ate this system under the Transaction Processing 
Performance Council's TPC-C benchmark to provide 
performance results in the competitive market for 
database applications. 

The TPC-C benchmark is a medium-complexity, 
on-line transaction processing (OLTP) workload.' 5 It 
is based on an order-entry workload, with different 
transaction types ranging from simple transactions to 
medium-complexity transactions that have 2 to 50 
times the number of calls of a simple transaction. 4 To 
run the TPC-C benchmark on a clustered system, the 
operating system and the database engine must present 
a single database to the benchmark client. Thus the 
TruCluster system running the Oracle Parallel Server 
differs greatly from a network-based cluster system by 
two significant features. First, the Digital UNIX distrib- 
uted raw disk (DRO) service enables the distributed 
Oracle Parallel Server to access all raw disk volumes 
regardless of their physical location in the cluster. 
Second, the Oracle Parallel Server uses Digital's distrib- 
uted lock manager (DI.iVl) to synchronize all access to 
shared resources (such as in memory cache blocks or 
disk blocks) across a TruCluster system. 

In tuning the system under test, we used the DRD 
and the DI.M services to balance the database across 
the TruCluster multicomputer system. The config- 
uration includes a specialized peripheral compo- 
nent interconnect (PCM) known as the MEMORY 
CHANNEL interconnect to greatly improve the band- 
width and latency between two or more member 
nodes/ We tuned the system under test to attain the 
peak bandwidth of 100 megabytes per second (MB/s) 
for heavv internode communication during check- 
pointing by using a dedicated PCI bus for the 
MEMORY CHANNEL interconnect. We also tuned 



nigral Technical Journal 



Vol. 8 No. 3 1996 



the system under test to use the very large memory 
technology and trade off memory for the database 
cache with memory for DLM locks to improve the 
throughput. (For a discussion of this technology, see 
rhe section Performance Evaluation Methodology.) 
We measured the maximum throughput, the 90th 
percentile response time for each transaction type, and 
the keying and think times. Finally, we compared our 
measured throughput and price/performance with 
competitive vendors like Tandem Computers and 
Hewlett-Packard Company. 

The rest of the paper is organized as follows. In the 
next section, we provide a synopsis of the TruClusrer 
technology and introduce the Oracle Parallel Server, 
an optional Oracle product that enables the user to use 
TruClusrer technology with the Oracle relational 
database management system. Following that, we give 
an overview of the TPC-C benchmark. Next, we 
describe the system under test and our performance 
evaluation methodology. Then we discuss our perfor- 
mance measurement results and compare them with 
competitive vendor results. Finally, we present our 
concluding remarks and discuss our future work. 

TruCluster Clustering Technology 

Digital's TruCluster configuration consists of inter- 
connected computers (uniprocessors or SMPs) and 
external disks connected to one or more shared, small 
computer systems interface (SCSI) buses providing 
services to clients.' 1 It presents a single raw volume 
namespace to a client with better application availabil- 
ity than a single system and better scalability than an 
SMP. A TruCluster configuration supports highly par- 
allelized database managers, such as rhe Oracle Parallel 



Server, ro provide incremental performance scaling 
of at least SO percent for transaction processing appli- 
cations. The underlying technology to provide this 
incremental growth includes a PCI-based MEMORY 
CHANNEL interconnect for communication between 
cluster members." The MEMORY CHANNEL 
interconnect provides a 100-MB/s, memory-mapped 
connection between cluster members. 7 The cluster 
members map transfers from the MEMORY 
CHANNEL interconnect into their memory using 
standard memory access instructions. The use of 
memory store instructions rather than special I/O 
instructions provides low latency (two microseconds) 
and low overhead for a transfer ofany length." 

The TruCluster for Digital UNIX product supports 
up to eight (four for commercial DLM/DRD-based 
applications) cluster members connected to a com- 
mon cluster interconnect. The computer systems 
supported within a cluster are AlphaServer systems of 
varying processor speed and number of processors. 
The member systems run applications (for example, 
user applications), as well as monitor the state of each 
member system, each shared disk, the MEMORY 
CHANNEL interconnect, and the network. These 
cluster members communicate over the MEMORY 
CHANNEL interconnect/' 8 A MEMORY CHANNEL 
configuration consists of a MEMORY CHANNEL 
adapter installed in a PCI slot and link cables to con- 
nect the adapters. In a configuration with more than 
two members, the MEMORY CHANNEL adapters 
are connected to a MEMORY CHANNEL hub. A 
typical TruClusrer configuration with a MEMORY 
CHANNEL hub is shown in Figure 1. 

Applications can attain high availability by connect- 
ing two or more member systems to one or more 
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A TruCluster Configuration with MEMORY CHANNEL Hub 
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shared SCSI buses, rhus constructing an Available 
Server Environment (ASH). A shared SCSI bus is 
required only for two-member configurations that do 
not have a MEMORY CHANNEL hub. Although 
MEMORY CHANNEL is the only supported cluster 
interconnect, Ethernet and fiber distributed data 
interface (FDDI) are supported for connecting clients 
to cluster members. Disks are connected either locally 
(i.e., nonshared ) to a SCSI bus or to a shared SCSI bus 
between two or more member systems. A single node 
in the cluster is used to serve the disk to other cluster 
members. Disks on local buses obviously become 
unavailable upon failure of the server node. The SCSI 
controller supported in this configuration is the PCI 
disk adapter, KZl'SA. 

The distinguishing feature of the TriiCIustcr 
software is its support of the MEMORY CHANNEL 
as a cluster interconnect, thus providing industrv- 
leadership performance to intracluster communica- 
tion. 1 ' The TruCluster software includes the follow ing 
components: the DI.M, the connection manager, the 
DRD, and the cluster communication service. The 
DI.M facilitates synchronization to shared resources to 
all member systems in a cluster by means of a run-time 
librarv. Cooperating processes use the DI.M to syn- 
chronize access to a shared resource, a DRD device, 
a file, or a program. The DLM serv ice is primarily used 
by the Oracle Parallel Server to coordinate access to the 
cache and shared disks that have the database installed." 
The connection manager maintains information about 
the cluster configuration and maintains a communica- 
tion path between each cluster member for use by the 
DLM. The DLM uses this configuration data and other 
connection manager services to maintain a distributed 
lock database. The DRD allows the exporting of clus- 
terwidc raw dev ices. This allows disk-based user-level 
applications to run within the cluster, regardless of 
where in the cluster the actual physical storage resides. 
Therefore a DRD service allows the Oracle Parallel 
Server parallel access to storage media from multiple- 



cluster members. The cluster communication service is 
used to allocate the MEMORY CHANNEL address 
space and map it to the processor main memory. 

TPC-C Benchmark 

The TPC-C benchmark depicts the activity of a generic 
wholesale supplier company. The hierarchy 
in the TPC-C business environment is shown in 
Figure 2. The company consists of a number of geo- 
graphically distributed sales districts and associated 
warehouses. Further, there are 10 districts under each 
warehouse with each district serving 3,000 (3K) cus- 
tomers. All the w arehouses maintain a stock of 10,000 
items sold bv the company. As the company grows, 
new warehouses and associated sales districts are cre- 
ated. The business activity consists of customer calls 
to place new orders or request the status of existing 
orders, payment entries, processing orders lor delivery, 
and stock-level examination. The orders on an average 
are composed of 10 order lines (i.e., line items). 
Ninety-nine percent of all orders can be met bv a local 
warehouse, and only one percent of them need to be 
sold bv a remote warehouse. 

The TPC-C logical database components consist of 
nine tables.' Figure 3 shows the relationship between 
these rabies, the cardinality of the tables (i.e., the num- 
ber of rows), and the cardinality of the relationships. 
The figure also shows the approximate row length in 
bytes for each table and the table size in megabytes. 
The cardinality of all the tables, except the item table, 
grows with the number of warehouses. The order, 
order-line, and history rabies grow indefinitely as the 
orders are processed. 

The five types of TPC-C transactions are listed in 
Table 1/ The new-order transaction places an order 
(of 10 order lines) from a warehouse through a single- 
database transaction; it inserts the order and updates 
the corresponding stock level for each item. Ninety- 
nine percent of the time the supplying warehouse is 



COMPANY 



WAREHOUSE 1 WAREHOUSE W 



DISTRICT 1 DISTRICT 10 



DISTRICT 1 DISTRICT 10 



I CUSTOMER I I CUSTOMER I I CUSTOMER I I CUSTOMER I 
1K 3K 1K 3K 1K 3K 1K 3K 

Source: Transaction Processing Performance Council, TPC Benchmark C Standard Specif icalion. 
Revision 3.0, February 1995. 

Figure 2 

Hierarchical Relationship in the TPC-C Business Environment 
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WAREHOUSE 
W,89,0 000089-W 



100K 



STOCK 
W100K.306.30 6-W 



W 



ITEM 

100K,82.8 2 



KEY: 



10 



HISTORY 
W.89.0.000089-W 



3+ 



NEW-ORDER 
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ORDER-LINE 
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TABLE NAME 
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TABLE SIZE 




LENGTH (BYTES) 





DISTRICT 
W10.95,0.00095'W 



3K 



CUSTOMER 
W30K.655.19.65-W 



ORDERS 
W30K+.24.0.72-W+ 



Note: + implies variations over measurement interval as rows are deleted or added. 



Figure 3 

TPC-C Database Tables Relationship 



Table 1 

TPC-C Requirements for Percentage in Mix, Keying Time, Response Time, and Think Time a 









90th 










Percentile 


Minimum 






Minimum 


Response 


Mean Think 




Minimum 


Keying 


Time 


Time 


Transaction 


Percentage 


Time 


Constraint 


Distribution 


Type 


in Mix 


(Seconds) 


(Seconds) 


(Seconds) 


New order 


N/A b 


18 


5 


12 


Payment 


43 


3 


5 


12 


Order status 


4 


2 


5 


10 


Delivery 


4 


2 


5 


5 


Stock level 


4 


2 


5 


5 



Notes 

' Table 1 is published in the Transaction Processing Performance Council's IPC Benchmark C Standard Specification, Revision 3.0, February 1995 
° Not applicable (N/A) because the measured rate is the reported throughput, though it is desirable to set it as high as possible (45%). 



the local warehouse, and only one percent of the time 
is it a remote warehouse. The payment transaction 
processes a payment for a customer, updates the cus- 
tomer's balance, and reflects the payment in the dis- 
trict and warehouse sales statistics. The customer 
resident warehouse is the local warehouse 85 percent 
of the time and is the remote warehouse 1 5 percent of 
the time. The order-status transaction returns the sta- 
tus of a customer order. The customer order is selected 
60 percent of the time bv last name and 40 percent of 
the time by identification number. The delivery trans- 



action processes orders corresponding to 10 pending 
orders, 1 for each district with 10 items per order. The 
corresponding entry in the new-order table is also 
deleted. The delivery transaction is intended to be exe- 
cuted in deferred mode through a queuing mecha- 
nism, rather than being executed interactively; there is 
no terminal response indicating the transaction com- 
pletion. The stock-level transaction examines the 
quantity of stock for the items ordered bv each of the 
last 20 orders in a district and determines the items 
that have a stock level below a specified threshold. 
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The TPC-C performance metric measures the total 
number of new orders completed per minute, with a 
90th percentile response-time constraint of 5 seconds. 
This metric measures the business throughput rather 
than the transaction execution rate. ' It is expressed in 
transactions-per-minute C (tpmC). The metric implic- 
itly takes into account all the transaction types as their 
individual throughputs are controlled by the mix per- 
centage given in Table 1. The tpmC is also driven by 
the activity of emulated users and the frequency of 
checkpointing. 1 " The cycle for generating ,i TPC-C 
transaction bv an emulated user is shown in Figure 4. 

The transactions are generated uniformly and at 
random while maintaining a minimum percentage in 
mix for each transaction tvpe. Table 1 gives the mini- 
mum mix percentage for each transaction tvpe, the 
minimum keying time, the maximum 90th percentile 
response-time constraint, and the minimum think 
time defined bv the TPC-C specification. 

The delivery transaction, unlike the other trans- 
actions, must be executed in a deferred mode.' The 
response time in Table 1 is the terminal response 
acknowledging that the transaction has been queued 
and not that the delivery transaction itself has been 
executed. Further, at least 90 percent of the deferred 
delivery transactions must complete within 80 seconds 
of their being queued for execution. The performance 
tuning for the system under test determines the 
number of checkpoints done in the measurement 
interval and the length of the checkpointing inter- 
val. The TPC-C specification, however, defines the 
upper bound on the checkpointing interval to be 
30 minutes.' 

The other TPC-C metric is the price/performance 
ratio or dollars per tpmC. This metric is computed bv 
dividing the total fivc-vcar system cost for the system 
under test with the reported tpmC." 

Performance Evaluation Methodology 

In this section, we first describe the configuration of 
the svstem under rest (SUT) used for the performance 
evaluation of the TruClusrer svstem under the TPC-C 
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Figure 4 

Cycle for Generating a TPC-C Transaction by an Emulated User 
Digital Technical Journal Vol. 8 No. 3 1996 



workload. Then we discuss the testing strategy used to 
enhance the performance of the SL'T. 

We show the configuration of the client-server SUT 
in Figure 5. The server SUT consists of a TruCluster 
configuration with four nodes; each node is an 
AlphaServer 8400 5/350 svstem with eight 350- 
mega hertz. (MHz) CPUs and 8 gigabytes (GR) of 
memory. These nodes are connected together bv a 
MEMORY CHANNEL link cable from the MEMORY 
CHANNEL adapter on the node to a single M EMORY 
CHANNEL hub. The local storage configuration for 
each node consists of 6 HSZ40 redundant array of 
inexpensive disks (RAID) controllers, 31 11Z28 and 
141 RZ29 disk drives, connected to the node bv SCSI 
buses to 6 KZPSA disk adapters. Further, each node is 
connected to FDD I by a DEFPA FDDI adapter. The 
nodes communicate with the clients over this FDDI. 

The client SIT consists of 16 AlphaServer 1000 
4/266 systems, each w ith 512 MR of" memory, one 
RZ28 disk drive, and one DEFPA FDDI adapter. 12 The 
remote terminal emulators ( RTFs) that are used to gen- 
erate the transactions and measure the various times 
(i.e., think, response, or keving time) for each trans- 
action are 16 VAXstation 3100 workstations, each with 
one RZ28 disk drive. From our logical description of 
the network topology shown in Figure 6, we see that 
each of the four nodes in the cluster is connected to four 
client systems, and each RTF. is connected to one client 
system. The Four clients associated with each node are 
connected to a DEChub 900 switch. Each of the four 
DEChub 900 products contains two concentrators, 
one DEFHU-MU 14-port unshielded twisted-pair 
(UTP) concentrator (for FDDI) and one DF.FHU-MH 
concentrator (for the twisted-pair Ethernet). The 
DEChub 900 switches are connected to an 8-port 
GIGAswitch svstem, which is used to route communi- 
cations between the client and the server. 

The software configuration of the server svstem is 
the TruCluster software running under the Digital 
UNIX version 4. OA operating svstem and the Oracle 
Parallel Server database manager (Oraclc7 version 7.3) 
installed on each cluster member. The software config- 
uration installed on each client svstem is the Digital 



DISPLAY SCREEN 



THINK TIME: WHILE 
SCREEN REMAINS 
DISPLAYED 



KEY TIME: TO 
ENTER REQUIRED 
INPUT FIELD 



MEASURE 
TRANSACTION 
RESPONSE TIME 



DISPLAY DATA 



6 HSZ40 
RAID 

CONTROLLERS 



8-CPU, 8-GB 
ALPHASERVER 8400 
5/350 SYSTEM 



8-CPU, 8-GB 
ALPHASERVER 8400 
5/350 SYSTEM 



6 HSZ40 
RAID 

CONTROLLERS 



31 RZ28 

AND 141 RZ29 

DISK DRIVES 



6 HSZ40 
RAID 

CONTROLLERS 



31 RZ28 

AND 141 RZ29 

DISK DRIVES 



MEMORY 
CHANNEL HUB 



8-CPU. 8-GB 
ALPHASERVER 8400 
5/350 SYSTEM 



8-CPU, 8-GB 
ALPHASERVER 8400 
5/350 SYSTEM 



FDDI 



FDDI 



DECHUB 900 
SWITCH 



4 VAXSTATION 3100 
WORKSTATIONS 



4 VAXSTATION 3100 
WORKSTATIONS 




DECHUB 900 
SWITCH 



GIGASWITCH 
SYSTEM 



31 RZ28 
AND 141 RZ29 
DISK DRIVES 



DECHUB 900 
SWITCH 



A. 





6 HSZ40 
RAID 

CONTROLLERS 



31 RZ28 

AND 141 RZ29 

DISK DRIVES 



DECHUB 900 
SWITCH 



4 VAXSTATION 3100 
WORKSTATIONS 



4 VAXSTATION 3100 
WORKSTATIONS 



16 ALPHASERVER 1000 4/266 
SYSTEMS 



Figure 5 

Client-Server System under Test 



UNIX version 3. 2D operating system and the BFA 
Tuxedo Systcm/T version 4.2 transaction processing 
monitor. Further, each RTF runs the OpcnVMS oper- 
ating system and a proprietary emulation package, 
VAX RTF. In the remainder of this section, we discuss 
the testing strategy used to generate the transactions 
on the front end. Then we discuss the tuning done on 
the hack end to achieve the maximum possible tpmC 
measurements fi'»m the SUT. 

In conformance with the TPC-C specification, we 
used a series of RTFs to drive the SUT. The one-to- 
one correspondence between emulated users on the 
RTF and the TPC client forms on the client required 
us to determine the maximum number of users to be 
generated by the RTF. The main factor we used to 
determine the number of users was the client's mem- 
ory size. We assumed that on a client, 32 iVlB of mem- 
ory is used for the operating system and 0.25 MB for 
each TPC client form process. Therefore, with these 
constraints, each RTF generates 1,620 emulated users. 
The emulated users then generate transactions ran- 
domly based on the predefined transaction mix (as 



described in Table 1 ) with a unique seed. This ensures 
the mix is well defined and a variety of transaction 
types are running concurrently (to better simulate a 
real-world environment). We had a local area trans- 
port (LAT) connection over Ethernet between each 
emulated user and a corresponding TPC client form 
process on the client for faster communication. We 
show the communication between an RTF, a client, 
and a server in Figure 7. 

We built five order queues on each client corre- 
sponding to a transaction rvpe, which allow ed us to 
control the transaction percentage mix. A TPC client 
form process queues transactions generated by the 
emulated users to the appropriate order queue using 
Tuxedo library calls. These transaction requests in 
each queue are processed in a first in, first out (FIFO) 
order bv the Tuxedo server processes running on the 
client. We had 44 Tuxedo server processes that were 
not evenlv distributed among the 5 order queues but 
were distributed so that the number of Tuxedo server 
processes dedicated to a queue was directly correlated 
to the percentage of the workload handled bv the 
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Each emulated user on the RTE uses a different 
seed so all clients are not executing the mix in 
the same order. 

There is a one-to-one relationship between 
emulated users and TPC Client Forms, 
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For this test, 1 .620 users were emulated 
on each RTE This number, however, is 
dependent on the amount of memory on 

the client. 
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Figure 7 

Communication between an RTE, a Client, and a Serv er 



queue. In other words, the greater the percentage of 
the workload on a queue, the greater the number 
of Tuxedo server processes dedicated to that queue. 
The number of Tuxedo server processes per client is 
computed based on the rule of thumb that each queue 
should have no more than 300 outstanding requests 
during checkpointing and 15 at other times. These 
Tuxedo server processes communicate with the server 
system (cluster node) using the Transmission Control 
Protocol/Internet Protocol (TCP/IP) over l'DDl to 
execute related database operations. 1 ' 

The industry-accepted method of tuning the TPC-C 
back end is to add enough disks and disk controllers on 
the serv er to eliminate the potential for an I/O bottle- 
neck, thus forcing the CPU to be saturated. Once the 
engineers are assured that the performance limitation is 
CPU saturation, the amount of memory is tuned to 
improve the database hit ratio. Because all vendors sub- 
mitting TPC-C results use this style of tuning, the per- 
formance limitation forTPC-C is usually the back-end 
server's CPU power. In fact, tests have shown that if 
this method of tuning is not followed on the back-end 
server, the user will not obtain the optimal TPC-C per- 
formance results. Instead, the tests reveal a back-end 
server configuration that has not fully utilized the 
server's potential by having unbalanced CPU and I/O 
capabilities. This type of configuration not onlv reduces 
the server's throughput capacitv but also adversely 
affects the price/performance of the SUT. 

On the back end, we used TruCluster technology 
features to achieve the maximum possible transactions 
per minute (tpm). H We balanced the I/O across all the 
RAID controllers and disks of the cluster and distrib- 
uted the database across all the server nodes. We dis- 
tributed the database such that each node in the 
cluster had an almost equal part of the database. The 
TPC-C benchmark execution requires a single data- 
base view across the cluster. We used the DRD and 
DLM services of the TruCluster software to present 
a contiguous view of the database across the cluster. If 
both the database and the indexes could have been 
completely partitioned, we could have achieved close 
to linear scaling per node. However, since the Oracle- 
Parallel Server does not have horizontal partitioning 
of the indexes, we could not completely partition the 
indexes across the cluster. 15 This resulted in 15 percent 
to 20 percent of internodal access, which means that 
1 5 percent to 20 percent of the new orders were satis- 
fied by remote warehouses, therefore making our 
TPC-C results more realistic. 

We also tuned the physical memory to trade off 
memory for database cache and the DI.iVl locks. 
Heuristically, we observed a 40-percent gain in 
throughput on a single-node AlphaServer 8400 5 /350 
server system running TPC-C when the memory size 
was increased from 2 GB to 8 GB. This is because, with 
more data being served by memory, the number of 



processor stalls decreases, and the database-cache hit 
ratio improves from 88 percent to more than 95 per- 
cent. 1 " Tuning physical memorv beyond 2 GB is called 
verv large memory (VI. M). We used the tpm results of 
the AlphaServer 8400 svstem to tune the phvsical 
memory size and configuration. We show these mea- 
sured tpm results for the AlphaServer 8400 cluster 
systems in Figure 8. 

To achieve optimal server performance, it is impor- 
tant to tune the amount of memory used by the Oracle 
System Global Area (SGA) and the DLM. Our testing 
found that using VLM to increase the size of the SGA to 
5.0 GB of phvsical memory yielded optimal perfor- 
mance in a TruCluster env ironment. However, it is 
important to note that on a single-node server that does 
not run the Oracle Parallel Serv er, we could assign 6.6 
GB of phvsical memorv to the SGA. (One reason that 
the SGA was smaller in an Oracle Parallel Server envi- 
ronment is that memorv needed to be set aside for the 
DLM.) Consequently, as seen in Figure 8, the tpm on 
a single-node cluster sv stem running the Oracle Parallel 
Server (84K tpm) is less than a single-node cluster not 
running the Oracle Parallel Server ( 1 1 4K tpm). 

In an Oracle Parallel Server environment, we 
assigned 1 GB of memory to the DLM for the follow- 
ing reasons: The DLM, under the 64-bit Digital UNIX 
operating svstem, requires 256 bvtes for each lock. In 
addition, the DLM must be able to hold at least one 
other location (and sometimes three) for lock call- 
back. As a result, each lock requires between 512 bvtes 
and 1 kilobyte (KB) of physical memorv. To tune the 
system, we added more locks to increase the granu- 
larity of the locks and reduce lock contention. We 
observed that for this configuration, a system of this 
size supporting the Oracle Parallel Server requires 
1 million locks (occupying 1 GB of memory) for the 
DLM when using 5.0 GB of memorv for the SGA. 
Again heuristically, we observed that if we used less 
memorv for the DLM, the total number of locks per 
page was reduced. The decrease in locks per page- 
increases contention across nodes and hence reduces 
the tpm as the number of nodes increases. 

With the help of engineers from Digital's 
MEMORY CHANNKL Group, we were able to use a 
hardware data analyzer to measure the percentage of 
the MEMORY CHANNEL interconnects bandwidth 
used when running the TPC-C benchmark. By using 
the data analyzer, we determined that we do not 
approach saturation of the PCl-based MEMORY 
CHANNEL hardware during a TPC-C test, even 
though it is capable of sustaining a peak throughput 
rate of 100 MB/s. In fact, we observed that the 
MEMORY CHANNEL bandwidth was not saturated; 
a TPC-C test required a peak throughput rate of 
only 15 MB/s to 17 MB/s from the MEMORY 
CHANNEL. As stated previously, the benchmark 
specification forces 1 5 percent of the database accesses 
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TPC-C Results on the AlphaServer 8400 Family 



to be remote, resulting in database aeeesses across the 
MEMORY CHANNEL. Using the ORE) administra- 
tion service available with the UNIX Tru Cluster soft- 
ware, w e measured the DRD remote read percentages 
to match flic 15-percent remote accesses rate. The 
DRD remote write performance was only 3 percent to 
4 percent during the steady state and rose to 10 per- 
cent to 1 1 percent during a database checkpoint. It is 
important to note that the TPC-C benchmark per- 
forms random 2K I/Os using the Oracle Parallel 
Server. Small, random I/O transfers are much more 
difficult to perform than large, sequential transfers. 
Ik-cause the MEMORY CHANNEL interconnect not 
only has sufficient bandwidth for TP( '.-C but also pro- 
vides excellent latency (less than 5 microseconds), we 
arc able to report very good scaling results. 

In the section TPC-C Benchmark, we discussed that 
the time taken for a checkpoint impacts the through- 
put. Therefore, we focused on improving the check- 
pointing time to increase the tpmC number. First, 
we used a dedicated PCI bus on each node for the 
MEMORY CHANNEL interconnect and thus 
obtained a 5-pcrccnt improvement in performance 
during checkpointing. Next we implemented the 
highly optimized "fastcopv" routine in ORD, which 
packs data on the PCI when transmitting through the 
MEMORY CHANNEL interconnect. 

Performance Measurement Results 

In this section, wc present our results for the 
TruClustcr configuration running the I PC C work- 
load and compare them with results from competitive 



vendors. We conducted the test on a database with 
2,592 warehouses and 25,920 emulated users. The 
database was equallv divided, which means each node 
contained 648 warehouses and served 6,480 emulated 
users. We show the initial cardinality of the database- 
tables in Table 2. The cardinality of the history, orders, 
new-order, and order-line tables increased as the test 
progressed and generated new orders. We conducted 
the experimental tuns for a minimum of 160 min- 
utes." 1 The measurement on the SUT began approxi- 
mately 3 minutes after the simulated users had begun 
executing transactions. The measurement period of 
120 minutes, however, started after the SUT attained a 
steady state in approximately 30 minutes. In agree- 
ment with the TPC-C specification, wc performed 
4 checkpoints at 30-minute intervals during the mea- 
surement period. 

On the SIT, wc measured a maximum throughput 
of 30, 390. 65 tpmC, which unveiled a new record high 
in the competitive market for database applications 
and UNIX serv ers. We repeated the experiment once- 



Table 2 

Initial Cardinality of the Database Tables 



Warehouse 


2,592 


District 


25,920 


Customer 


77,760,000 


History 


77,760,000 


Order 


77,760,000 


New order 


23,328,000 


Order lines 


777,547,308 


Stock 


259,200,000 


Item 


100,000 
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mure to ensure the reproducibility of the maximum 
measured tpmC. Digital Equipment Corporation and 
Oracle Corporation also present a price/performance 
ratio of S305 per tpmC. 

In Table 3, we present the total occurrences of each 
transaction tvpe and the percentage transaction mix 
used to generate the transactions in each test run. We 
compare the percentage transaction mix in Table 1 
and Table 3 and observe that our measurements are in 
agreement w ith the TPC-C specification. We present 
the 90th percentile response rime measured for each 
transaction type in Table 4. The 90th percentile 
response time we measured is well below the TPC-C 
specification requirement (compare Table 1 and Table 
4). In Table 5, we present the minimum, average, and 
maximum keying and think times. Again, we comply 
with the TPC-C specification (compare Table 1 and 
Table 5). 

Now we compare the maximum throughput 
achieved on the AlphaServer 8400 5/350 four-node 
Tru Cluster configuration with results from Tandem 



Table 3 

Measured Total Occurrences of Each Transaction Type 
and Percentage Transaction Mix 



Transaction 
Type 



Total 

Occurrences 



Percentage 
in Mix 



New order 


3,645,228 


44.47 


Payment 


3,540,119 


43.19 


Order status 


336,255 


4.10 


Delivery 


337,423 


4.12 


Stock level 


337,730 


4.12 



Table 4 

Measured 90th Percentile Response Time 



Transaction 
Type 



90th Percentile 
Response Time 



New order 
Payment 
Order status 
Delivery (interactive) 
Delivery (deferred) 
Stock level 



3.4 
3.2 
0.9 
0.4 
5.0 
1.7 



Table 5 

Measured Keying/Think Times 



Transaction 


Minimum 


Average 


Maximum 


Type 


(Seconds) 


(Seconds) 


(Seconds) 


New order 


18.0/0.00 


18.1/12.2 


18.8/188.1 


Payment 


3.0/0.00 


3.1/12.1 


3.7/201.4 


Order status 


2.0/0.00 


2.1/10.1 


2.7/125.6 


Delivery 


2.0/0.00 


2.1/5.2 


2.7/74.9 


Stock level 


2.0/0.00 


2.1/5.2 


2.7/62.7 



Computers and from Hewlett-Packard Company 
(HP).' 7 The Tandem nonstop Himalaya K10000-1 12 
112-node cluster reported 20,9 1 8.03 tpmC at $ 1 ,532 
per tpmC. Observe that Digital's measured tpmC 
are 45 percent higher than Tandem's, and Digital's 
pncc/pcrformancc is 20 percent of T andem's cost. 
In Figure 9, we compare Digital's performance w ith 
HP's. The HP 9000 EPS 30 C/S 48-CPU four-node 
cluster system using the Oracle Parallel Server Oracle7 
version 7.3 reported 17,826.50 tpmC at S396.' s 
Again, observe that the tpmC we measured on 
Digital's TruCluster configuration are 59 percent 
higher than HP's at 77 percent of the cost. 

Conclusion and Future Work 

In this paper, we discussed the performance evaluation 
of Digital's TruCluster multicomputer system, specifi- 
cally the AlphaServer 8400 5/350 32-CPU, four- 
node cluster system, under the TPC-C workload. 
For completeness, we gave an overview of TruCluster 
clustering technology and the TPC-C benchmark. We 
discussed tuning strategies that took advantage of 
TruCluster technology features like the MEMORY 
CHANNEL interconnect, the DRD, and the DI.M. 
We tuned memory to use VI, M for the database cache 
and made memorv allocation trade-offs for DI.M locks 
to reduce processor stalls ami improve cache hit ratios. 

One common concern is performance scalability of 
cluster systems, that is, incremental performance 
growth with the size of the cluster. In Figure 8, we 
showed the measured performance of an SMP server, 
both with and without the Oracle Parallel Server, and 
cluster configurations with two, three, and four SMP 
servers. We do nor see linear sealing because the Oracle 
Parallel Server imposes a significant amount of over- 
head on each cluster node. This overhead equates to 
approximately a 25-pcrcent reduction in tpmC on a 
per node basis. However, it is important to note that 
due to the time constraints of obtaining audited results 
for the product announcement, the testing team was 
unable to fully tune the serv er and saturate the server 
CPUs. In future testing, additional performance tuning 
is planned to further optimize server performance. 

The performance testing of the TruCluster multi- 
computer system was time-consuming and expensive. 
Thus, answering "what if" questions regarding sizing 
and tuning of varying cluster configurations under dif- 
ferent workloads using measurements is an expensive 
(with respect to money and time) task. To address this 
problem, we are developing an analytical performance 
cluster model for capacity planning and tuning. 1 " The 
model will predict the performance of cluster eon- 
figurations (ranging from two to eight members) 
with varying workloads and system parameters (for 
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example, memory size, storage size, and CPU power ). 
We will implement this model in Visual C++ to 
develop a capaeitv planning tool. 
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Optimization techniques have been used to 
deploy very large memory (VLM) database tech- 
nology on Digital's AlphaServer 8400 multi- 
processor system. VLM improves the use of 
hardware and software caches, main memory, 
the I/O subsystems, and the Alpha 21164 micro- 
processor itself, which in turn causes fewer 
processor stalls and provides faster locking. 
Digital's 64-bit AlphaServer 8400 system running 
database software from a leading vendor has 
achieved the highest TPC-C results to date, an 
increased throughput due to increased database 
cache size, and an improved scaling with sym- 
metric multiprocessing systems. 



Digital's AlphaServer 8400 enterprise-class server com- 
bines a 2-gigabvtc-per-seconcl (GB/s) multiprocessor 
bus w ith the latest Alpha 2 1 164 64-bit microprocessor. 1 
Between October and December 1995, an AlphaServer 
8400 multiprocessor system running the 64-bit Digital 
UNIX operating system achieved unprecedented results 
on the Transaction Processing Performance Council's 
TPC-C benchmark, surpassing all other single-node 
results by a factor of nearly 2. As of September 1996, 
only one other computer vendor has come within 20 
percent of the AlphaServer 8400 system's TPC-C 
res li Its. 

A memory size of 2 GP> or more, known as very 
large memorv (VLM), w as essential to achieving these 
results. Most 32-bit UNIX svstcms can use 31 bits 
for virtual address space, leav ing 1 bit to differentiate 
between system and user space, which creates diffi- 
culties when attempting ro address more than 2 GB 
of memorv (w hether virtual or physical). 

In contrast, Digital's Alpha microprocessors and the 
Digital UNIX operating system have implemented 
a 64-bit virtual address space that is four billion times 
larger than 32-bir systems. Today's Alpha chips are- 
capable of addressing 43 bits of physical memory. The 
AlphaServer 8400 system supports as many as 8 physi- 
cal modules, each of w hich can contain 2 CPUs or 
as much as 2 GB of memorv.' Using these limits, data- 
base applications tend ro achieve peak performance 
using 8 to 10 CPUs and as much as 8 Gi of memorv. 

The examples in this paper are drawn primarily f rom 
the •ptimization of a state-of-the-art database appli- 
cation on AlphaServer systems; similar technical con- 
siderations apply to any database running in an Alpha 
environment. As of September 1996, three of the 
foremost database companies have extended their 
products to exploit Digital's 64-bit Alpha environ- 
ment, namely Oracle Corporation, Sybase, Inc., and 
Informix Software, Inc. 

The sections that follow describe the TPC-C work- 
load and discuss two database optimizations that arc 
useful regardless of memorv size: locking intrinsics 
and OM instruction-cache packing. (OM is a post- 
link time optimizer available on the Digital UNIX 
operating system.) 5 VLM experimental data is then 
presented in the section VLM Results. 
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TPC-C Benchmark 

The TPC-C benchmark was designed to mimic com- 
plex on-line transaction processing (OLTP) as speci- 
fied by the Transaction Processing Performance 
Council. 4 The TPC-C workload depicts the activity of 
a generic wholesale supplier company. The company 
consists of a number of distributed sales districts and 
associated warehouses. Each warehouse has 10 districts. 
Each district services 3,000 customer requests. Each 
warehouse maintains a stock of 100,000 items sold by 
the company. The database is scaled according to 
throughput (that is, higher transaction rates use larger 
databases). Customers call the company to place new 
orders or request the status of an existing order. 

Method 

The benchmark consists of five complex transactions 
that access nine different tables. 5 The five transactions 
are weighted as follows: 

1. Forty-three percent — A new-order transaction 
places an order (an average of 10 lines) from a ware- 
house through a single database transaction and 
updates the corresponding stock level tor each item. 
In 99 percent of the new-order transactions, the 
supplying warehouse is the local warehouse and only 
1 percent of the accesses are to a remote warehouse. 

2. Forty-three percent — A payment transaction 
processes a payment tor a customer, updates the cus- 
tomer's balance, and reflects the payment in the 
district and warehouse sales statistics. The customer 
resident warehouse is the home warehouse 85 per- 
cent of the time and is the remote warehouse 
1 5 percent of the time. 

3. Four percent — An order-status transaction returns 
the status of a customer order. The customer order is 
selected 60 percent of the time by the last name and 
40 percent of the time by an identification number. 

4. Four percent — A delivery transaction processes 
orders corresponding to 10 pending orders for each 
district with 10 items per order. The corresponding 
entry in the new-order table is also deleted. The 
delivery transaction is intended to be executed in 
deferred mode through a queuing mechanism. 
There is no terminal response for completion. 

5. Four percent — A stock-level transaction examines 
the quantity of stock for the items ordered by each 
of the last 20 orders in a district and determines the 
items that have a stock level below a specified 
threshold. This is a read-only transaction. 

The TPC-C specification requires a response time 
that is less than or equal to 5 seconds for the 90th 
percentile of all but the delivery transaction, which 
must complete within 20 seconds. 



In addition, the TPC-C specification requires that 
a complete checkpoint of the database be done. A 
checkpoint flushes all transactions committed to the 
database from the database cache (memory) to non- 
volatile storage in less than 30 minutes. This impor- 
tant requirement is one of the more difficult parts 
to tune for systems with VLM.° 

Results 

Table 1 gives the highest single-node TPC-C results 
published by the Transaction Processing Performance 
Council as of September 1, 1996.' 

For a complete TPC-C run, a remote terminal 
emulator must be used to simulate users making trans- 
actions. For performance optimization purposes, how- 
ever, it is convenient to use a back-end-only version 
of the benchmark in which the clients reside on the 
server. The transactions per minute (tpm) derived in 
this environment are called back-end tpm in Table 2 
and cannot be compared to the results of audited runs 
(such as those given in Table 1 ). However, when a per- 
formance improvement is made to the back-end-only 
environment, performance improvements are clearly 
seen in the full environment. 

Tuning for the system is iterative. For each data 
point collected, clients were added to try to saturate 
the server; then the amount of memory was varied for 
the database cache. A trade-off between database mem- 
ory, system throughput, and checkpoint performance 
required us to tune each data point individually. The 
system was configured with a sufficient number of 
disk drives and I/O controllers to ensure that it was 
100-percent CPU saturated and never I/O limited. 
The experiments reported in this paper use database 
sizes of approximately 100 GB, spread over 172 RZ29 
spindles and 7 KZPSA adapter/HSZ40 controller pairs, 
with each HSZ40 controller using 5 small computer 
systems interface (SCSI) buses. 

Tuning Specific to Alpha 

UNIX databases on Digital's Alpha systems were first 
ported in 1992. For database companies to fully use 
the power of Alpha's 64-bit address space, each data- 
base vendor had to expand the scope of its normal 
32-bit architecture to make use of 64-bit pointers. 
Thus, each database could then address more than 
2 GR of physical memory without awkward code seg- 
ments or other manipulations to the operating system 
to extend physical address space. 

By 1994, most vendors of large databases were offer 
ing 64-bit versions of their databases for Digital's Alpha 
environment. As a group chartered to measure database 
performance on Alpha systems, Digital's Computer 
Systems Division (CSD) Performance Group worked 
with each database vendor and with the Digital System 
Performance Expertise Center to improve performance. 
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Table 1 

TPC-C Results 



System 


Throughput 


Price/ 

Performance 


Number 
of CPUs 


Date 


AlphaServer 8400 5/350, 

Oracle Rdb7 V7.0, OpenVMS V7.0 


14,227 tpmC 


$269/tpmC 


10 


May 1996 


AlphaServer 8400 5/350, 

Sybase SQL Server 1 1 .0, Digital UNIX, 

ifi Tuxedo 


14,176 tpmC 


$198/tpmC 


10 


May 1996 


AlphaServer 8400 5/350, 

Informix V7. 21, Digital UNIX, iTi Tuxedo 


13,646.17tpmC 


$277/tpmC 


10 


March 1996 


Sun Ultra Enterprise 5000, 
Sybase SQL Server V 11.0.2 


1 1,465.93 tpmC 


$191/tpmC 


12 


April 1996 


AlphaServer 8400 5/350, 
Oracle7, Digital UNIX, iTi Tuxedo 


1 1,456.13tpmC 


$286/tpmC 


8 


December 1995 


AlphaServer 8400 5/300, 

Sybase SQL Server 1 1 .0, Digital UNIX, 

iTi Tuxedo 


11,014.10tpmC 


$222/tpmC 


10 


December 1995 


AlphaServer 8400 5/300, 
Oracle7, Digital UNIX, iTi Tuxedo 


9,41 4.06 tpmC 


$316/tpmC 


8 


October 1995 


SGI CHALLENGE XL Server, 
INFORMIX-OnLine V7.1, IRIX, IMC Tuxedo 


6,313.78 tpmC 


$479/tpmC 


16 


November 1995 


HP 9000 Corporate Business Server, 
Sybase SQL Server 11, 
HP-UX, IMC Tuxedo 


5,621.00 tpmC 


$380/tpmC 


12 


May 1995 


HP 9000 Corporate Business Server, 
Oracle7, HP-UX, IMC Tuxedo 


5,369.68 tpmC 


$535/tpmC 


12 


May 1995 


Sun SPARCcenter 2000E 
Oracle7, Solaris, Tuxedo 


5,124.21 tpmC 


$323/tpmC 


16 


April 1996 


Sun SPARCcenter 2000E, 
INFORMIX-OnLine 7.1, 
Solaris, Tuxedo 


3,534.20 tpmC 


$495/tpmC 


20 


July 1995 


IBMRS/6000 PowerPC R30, 
DB2 for AIX, AIX, IMC Tuxedo 


3,119.16tpmC 


$355/tpmC 


8 


June 1995 


IBM RS/6000 PowerPC J30, 
udz tot AiA, AiA, iiviv_ luxeuo 


3119.16tpmC 


$349/tpmC 


8 


June 1995 


Table 2 

Amount of Memory versus Back-end tpm, 


Database-cache Miss Rate, and Instructions per Transaction 


Datahasp Rsrk-pnri 

Memory (Normalized 
(GB) tpm) 




Relative 
Database-cache 
Miss (Percentage) 




Relative 
Instructions per 
Transaction 


1 1.0 




1.0 




1.0 


2 1.3 




0.73 




0.75 


3 1.5 




0.58 




0.63 


4 1.6 




0.50 




0.57 


5 1.7 




0.42 




0.50 


6 1.8 




0.40 




0.45 



Two optimizations generally realized 20 percent gains 
on Alpha systems/ These were 

1 . Optimization ofspinlock primitives supported now 
bv DEC! C compiler intrinsics 

2. OM profile-based link optimization, which per- 
forms instruction-cache packing during the final 
link of the database 



In addition, the Digital UNIX operating system 
version 3.2 and higher versions have optimized I/O 
code paths and support advanced processor affinity 
and other scheduling algorithms that have been opti- 
mized for enterprise-class commercial performance. 
With these optimizations, database performance on 
Digital's Alpha systems has been significantly improved. 
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Lock Optimization 

Locks arc used on multiprocessor systems to synchro- 
nize atomic access to shared data. A lock is either 
unowned (clear) or owned (set). A key design decision 
leading to good multiprocessor performance and scal- 
ing is partitioning the shared data and associated locks. 
The discussion of how to partition data and associated 
locks to minimize contention and the number of locks 
required is beyond the scope of this paper. 

The implementation of locks requires an atomic 
test-and-set operation. On a particular system, the 
implementation of the lock is dependent on the primi- 
tive test-nnd-set capabilities provided bv the hardw are. 

Locks are used to synchronize atomic access to 
shared data. A shared data element that requires 
atomic access is associated with a lock that must be 
acquired and held while the data is modified. On mul- 
tiprocessing systems, locks are used to synchronize 
atomic access to shared data. A sequence of code that 
accesses shared data protected bv a lock is called a crit- 
ical section. A critical section begins with the acquisi- 
tion of a lock and ends with the release of that lock. 
Although it is possible to have nested critical sections 
where multiple locks are acquired and released, the 
discussion in this section is limited to a critical section 
w ith a single lock. 

To provide atomic access to shared data, the critical 
section running on a given processor locks the data by 
acquiring the lock associated with the shared data. In 
the simplest case, if a second processor tries to acquire 
access to shared data that is already locked, the second 
processor loops and continually retries the access 
(spins) until the processor owning the lock releases it. 
In a complex case, if a second processor tries to acquire 
access to shared data that is already locked, the second 
processor loops a few times and then, if the lock is still 
ow ned bv another processor, puts itself into a wait 
state until the processor owning the lock releases it. 

The Alpha Architecture Reference Manual specifies 
that ". . .the order of reads and writes done in an Alpha 
implementation may differ from that specified by the 
programmer. " s Therefore, process coordination 
requires a special test-and-set operation that is imple- 
mented through the load-locked/store-conditional 
instruction sequence. To provide good performance 
and scaling on multiprocessor Alpha systems, it is 
important to optimize the test-and-set operation to 
minimize latency. The test-and-set operation can be 
optimized bv the following methods: 

■ Use an in-lined load-locked/store-conditional 
sequence through an embedded assembler or com- 
piler intrinsics. 

■ Preload a lock using a simple load operation prior 
to a load -locked operation. 



■ If a lock is held, spin on a simple load instruction 
rather than a load -locked instruction sequence. 

The basic hardw are building block used to imple- 
ment the acquisition of a lock is the test-and-set 
operation. On many microprocessors, an atomic test- 
and-set operation is provided as a single instruction. 
On an Alpha microprocessor, the test-and-set opera- 
tion needs to be built out of load-locked (LD.\_L) and 
store-conditional (ST\_C) instructions. The I.,Dx_L 
... STx_C instructions allow the Alpha microprocessor 
to provide a multiprocessor-safe method to implement 
the test-and-set operation with minimal restrictions on 
read and write ordering. The load-locked operation 
sets a locked flag on the cache block containing the 
data item. The store-conditional operation ensures 
that no other processor has modified the cache block 
before it stores the data. If no other processor has 
modified the cache block, the store-conditional opera- 
tion is successful and the data is w ritten to memorv. If 
another processor has modified the cache block, the 
store-conditional operation fails, and the data is not 
written to memorv. Optimizing the test-and-set 
sequence on Alpha systems is a complex task that pro- 
vides significant performance gains. 

Figure 1 shows code sequences that Digital's CSD 
Performance Group has given to database vendors to 
improve locking intrinsics in the Alpha environment. 
These code sequences can be used to implement spin- 
locks in the DHC C compiler on the Digital UNIX 
operating system. 

Using OM Feedback 

As previously mentioned, OM is a post-link time opti- 
mizer available on the Digital UNIX operating system. 
It performs optimizations such as compression of 
addressing instructions and dead code elimination 
through the use of feedback. The performance 
improvement provided bv OM on Alpha 21164 
svstems is dramatic for the following two reasons. 1 

■ The 21 164 microprocessor has an 8-kilobyte (KB) 
direct-mapped instruction cache, which makes 
code placement extremely important. In a direct- 
mapped cache, the code layout and linking order 
maps one for one to its placement in cache. Thus 
a poorlv chosen instruction stream layout or sim- 
plv unlucky code placement within libraries can 
alter performance bv 10 to 20 percent. Routines 
are frequently page aligned, which can increase 
the likelihood of cache collisions. 

■ The high clock rate of the Alpha 21164 micro- 
processor (300 to 500 megahertz [MHz]) 
requires a cache hierarchy to attempt to keep the 
CPU pipelines filled. The penalty of a first-level 
cache miss is 5 to 9 cvcles, which means that an 
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/ / T E S T_A N D_S E T implements the Alpha version of a test and set operation using 
//the load-locked .. s t o r e - c o n d i t i o n a I instructions. The purpose of this 
//function is to check the value pointed to by s p i n I o c k_a d d r e s s and, if the 
//value is 0, set it to 1 and return success (1) in RQ. If either the spinlock 
//value is already 1 or the s t o r e - c o n d i t i o n a I failed, the value of the spinlock 

3) is returned in R 0 . 

// 

/ /The 

// 0 - failure (spinlock was clear; still clear, store-conditional failed) 
/ / 

// 2 - failure (spinlock was set; still set, store-conditional failed) 
/ / 
/ / 

ldl_l $0,($16); " \ 

"or $0,1,$1; " \ 

"stl_c $1,($16); " \ 

"si I $0,1,$0; " \ 

"or $0,$1,$0 ", \ 

( s p i n I o c k_a d d r e s s ) ) ; 

)CK_ACQUIRE implements the simple case of acquiring a spinlock. If 
; is already owned or the store-condi tional fails, this function 
the spinlock is acquired. This function doesn't return until the 
a c q u i red. 

BAS I C_SPI NL0CK_ACQU I R E ( sp i n lock_ad dress ) \ 
status = 0; \ 

\ 

whi le (1 ) \ 
{ \ 
if ( * ( s p i n I o c k_a d d r e s s ) == 0) \ 
{ \ 
status = TEST_AND_SET ( s p i n I o c k_a d d r e s s ) ; \ 
if (status == 1) \ 
{ \ 
MB; \ 
break; \ 
} \ 
} \ 
> \ 
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instruction-cache miss rare of 10 ro 1 2 percent can 
effectively stall the CPU 70 to 80 percent of the 
rime. Conversely, decreasing the miss rate by 
2 percent can increase throughput bv 10 percent. 

OM performs profile-based optimization. A pro- 
gram is first partitioned into basic blocks (that is, 
regions containing only one entrance and one exit), 
and instrumentation code is added to count the num- 
ber of rimes each block is executed. The instrumented 
version of the program is run ro create a feedback file 
that contains a profile of basic block counts. OM then 
uses the feedback to rearrange the blocks in an optimal 
way for the first-level caches on the Alpha chip. The 
details of the procedure for using OM may be found 
in the manpage for cc on the Digital UNIX operating 
svstem but can be summarized as follows: 



■ Build executable with -non_shared -om options, 
producing prog. 

■ Use pixie to produce prog. pixie (the instrumented 
executable) and prog.Addrs (addresses). 

■ Run prog. pixie to produce prog. Counts, which 
records the basic block counts. 

■ Now build prog again with -non_shared -om -VVL, 
om_ireorg_feedback. 

VLM Results 

Figure 2 shows the increase in throughput realized 
when using VLM. Note that throughput nearly dou- 
bles as the amount of mcmorv allocated ro the data- 
base cache is varied from I CIS to 6 GB. Of course, the 
overall svstem requires additional memory bevond 
the database cache to run UNIX itself and other 
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Figure 2 

Database Cache Size versus Throughput 

processes. For example, an 8-GB system allows 6.6 GB 
to be used for the database cache. 

Performance Analysis 

Why docs the use of VLM improve performance by a 
factor of nearly 2? Using statistics within the database, 
we measured the database-cache hit ratio as memory 
was added. Figure 3 shows the direct correlation 
between more memory and decreased database-cache 
misses: as memory is added, the database-cache miss 
rate declines from 12 percent to 5 percent. This raises 
two more questions: ( 1 ) Why does the database-cache 
miss rate remain at 5 percent? and (2) Why does a 
small change in database-cache miss rates improve the 
throughput so greatly? 

The answer to the first question is that with a data- 
base size of more than 100 GB, it is not possible to 
cache the entire database. The cache improves the 
transactions that are read-intensive, but it does not 
entirely eliminate I/O contention. 



35 




0 1 1 ' ' ' <- 

1 2 3 4 5 6 

MEMORY IN GB 

KEY: 

■ ■ BUS UTILIZATION 

♦ ♦ B-CACHE MISS RATE 

A — A l-CACHE MISS RATE 

• • DATABASE CACHE MISS RATE 



Figure 3 

Cache Miss Rates and Bus Utilization 



To answer the second question, we need to look at 
the AlphaServer 8400 system's hardware counters that 
measure instruction-cache (I-cache) miss rate, board- 
cache (B-cache) miss rate, and the bandwidth used on 
the multiprocessor bus. With an increase in throughput 
and memory size, the VLM system is spanning a larger 
data space, and the bus utilization increases from 24 
percent to 32 percent. Intuitively, one might think this 
would result in less optimal instruction-and data-stream 
locality, thus increasing both miss rates. As shown in 
Figure 3, this proved true for instruction stream misses 
(I-cache miss rate) but not true for the data stream, as 
represented by the B-cache miss rate. The instruction 
stream rarely results in B-cache misses, so B-cache 
misses can be attributed primarily to the data stream. 

Performance analysis requires careful examination 
of the throughput of the system under test. The appar- 
ent paradox just related can be resolved if we normal- 
ize the statistics to the throughput achieved. Figure 4 
shows that the instruction-cache misses per transaction 
declined slightly as the memory size was increased from 
1 GB to 6 GB — and as transaction throughput doubled. 
Furthermore, the B-cache works substantially better 
with more memory: misses declined by 2X on a per- 
transaction basis. Why is this so? 

Analysis of the system monitor data for each run 
indicates that bringing the data into memory helped 
reduce the I/O per second by 30 percent. If the trans- 
action is forced to wait f or I/O operations, it is done 
asynchronously, and the database causes some other 
thread to begin executing. Without VLM, 1 2 percent 
of transactions miss the database cache and thus stall 
for I/O activity. With VLM, only 5 percent of the 
transactions miss the database cache, and the time to 
perform each transaction is greatly reduced. Thus each 
thread or process has a shorter transaction latency The 
shorter latency contributes to a 15-percent reduction 
in system context switch rates. We attribute the 
measured improvement in hardware miss rates per 
transaction when using VLM to the improvement in 
context switching. 

The performance counters on the Alpha micro- 
processor were used to collect the number of instruc- 
tions issued and the number of cycles. v In Table 2, 
the relative instructions per transaction results are the 
ratios of instructions issued per second divided by the 
number of new-order transactions. (In TPC-C, each 
transaction has a different code path and instruction 
count; therefore the instructions per transaction 
amount is not the total number of new-order trans- 
actions.) The relative difference between instructions 
per transaction for 1 GB of database memory versus 
6 GB of database memory is the measured effect of 
eliminating 30 percent of the I/O operations, satisfy- 
ing more transactions from main memory, reducing 
context switches, and reducing lock contention. 
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Improved CPU Scaling — More Efficient Locking 

A final benefit of using VLM is improved symmetric 
multiprocessing (SMP) scaling. Because the TPC-C 
workload has several transactions with high read con- 
tent, having the data available in memory, rather than 
on disk, allows an SMP system to perform more effi- 
cientlv. More requests can be serviced that are closer in 
cycles to the CPU. Data found in mcmorv is Jess than 
a microsecond awav, whereas data found on disk is 
on the order of milliseconds away. 

We have shown how this situation improves the 
overall svstem throughput. In addition, it improves 
SMP scaling. Figure 5 shows the relative scaling 
between 2 CPUs and 8 CPUs with only 2 CIH of system 
memory (1.5 GB of database cache) compared to the 
same configurations having 8 CI') of svstem mcmorv 
(6.6 C!B of database cache). 

We used the performance counters on the Alpha 
21164 microprocessor to monitor the number of 
cvcles spent on the mcmorv barrier instruction.'' 
/Mcmorv barriers are required for implementing 
mutual exclusion in the Alpha processor. They are used 
bv all locking primitives in the database and the operat- 
ing svstem. With VLM at 8 GB of mcmorv, we mea- 
sured a 20-percent decline in time spent in the memory 
barrier instruction. Larger memory implied less con- 
tention for critical disk and I/O channel resources and 
thus less time in the memory barrier instruction. 

Conclusions 

Open svstem database vendors are expanding into 
mainframe markets as open systems acquire grc.uer 
processing power, larger I/O subsvstems, and the 
ability to deliver higher throughput .it reasonable 
response times. To this end. Digital's AlphaServer 
8400 5/350 system using VLM database technology 
has demonstrated substantial gains in commercial 



performance when compared to systems without the 
capability to use VLM. The use of up to 8 GB of mem- 
ory helps increase system throughput by a factor of 2, 
even for databases that span 50 GB to 100 GB in size. 

The Digital AlphaServer 8400 5/350 svstem com- 
bined with the Digital UNIX operating system to 
address greater than 2 GB of memory has made possi- 
ble improved TPC-C results from several vendors. In 
this paper, we have shown how VLM 

■ Increased the throughput bv a factor of nearly 2 

■ Increased the database-cache hit ratios from 88 per- 
cent to 95 percent 

By using monitor tools designed for the Alpha plat- 
form, we have measured the effect of VLM in issuing 
fewer instructions per transaction on the Alpha 2 1 164 
microprocessor. When transactions are satisfied bv 
data that is already in memory, the CPU has fewer 
hardware cache misses, fewer mcmorv barrier proces- 
sor stalls, taster locking, and better SMP scaling. 

Future Digital AlphaServer systems that will be 
capable of using more physical memory will be able to 
further exploit VLM database technology. The results 
of industry-standard benchmarks such as TPC-C, 
which force problem si/.es to grow with increased 
throughput, will continue to demonstrate the realistic 
value of state-of-the-art computer architectures. 
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Building Collaboration 
Software for the Internet 



Collaboration software for the Internet's World 
Wide Web involves the development of shared 
information systems for network computing. 
The AltaVista Forum version 2.0 software from 
Digital contains extensions to World Wide Web 
technology that facilitate collaboration on the 
Internet. The extensions consist of a toolkit 
and a set of collaboration applications. The 
toolkit components include a built-in data- 
base with an indexing and search capability. 
Generic applications include discussion, docu- 
ment sharing, and calendar applications and 
administrative functions for managing users, 
teams, and access control. 



The Internet and the World Wide Web (WWW) 
have changed the scope of network computing. As 
the Internet user population has grown, so has the 
demand for better ways to collaborate on the Internet. 
Some examples include the ability to share and discuss 
issues of common interest, coauthor documents, and 
track project status. Although today's WWW is ideal 
for publishing information, it requires considerable 
customized programming to support collaboration. 
The AltaVista Forum version 2.0 product is both a set 
of collaborative applications and a toolkit (platform) 
that facilitates casv, efficient, and rapid development of 
collaborative applications for the Internet for both 
UNIX and Windows NT systems. 

In this paper, we describe our experiences in build- 
ing collaboration software for the Internet. We begin 
with a brief discussion of WWW technology and 
groupware applications. Then we present our design 
philosophy and the framework of the software and dis- 
cuss the applications supplied bv AltaVista Forum. 
Following that, we discuss the various experiences 
gained in developing software for the new Internet 
paradigm. We conclude the paper by discussing our 
plans for future development efforts. 

World Wide Web Technology 

Todav's Internet was originally a government-funded 
computer network that facilitated collaboration 
among academic researchers. Information exchange 
was conducted by means of electronic mail (e-mail) 
and tile transfer. Over time, bulletin-board style 
discussions were supported bv the Network News 
Transfer Protocol (NNTP), which propagated textual 
discussion threads to a large number of NNTP servers 
for viewing. With the development of the WWW tech- 
nology, collaborating over the Internet has become 
even easier. 

The WWW technology consists of the following 
elements: 

■ Universal resource locator (URL), a convention for 
information naming and Jinking 

■ Hypertext markup language (HTML), a text-based 
language for information rendering 
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■ Hypertext Transfer Protocol (HTTP), a simple 
client-server protocol to transport information 
associated w ith a URL 

■ Web browser, a program that renders HTML docu- 
ments, provides URL caching, and supports a 
directory for URLs 

■ Web server, a server that responds to requests for 
information from the Web browsers 

Information Access 

WWW technology has transformed the way users 
access information through computer networks. 
Access to information on the Internet was primarily 
text-based; with the WWW, users are able to access 
information in multimedia format. The combination 
of functionality (information linking, graphical inter- 
face, and caching), extensibility (for dealing with new 
protocols and new information tvpes), ease-of-use, 
and low cost appealed to a wide range of users in 
homes, offices, and corporations. In addition, the 
Mosaic-style of "point-and-click" graphical Internet 
browser has become the most widely accepted user 
interface for network computing. 

The most popular use of the WWW today is for pub- 
lishing information, and the process is comparable to 
the wav a newspaper publishes or a television station 
broadcasts information. The roles of the information 
provider and the information consumer are clearly 
defined. The information provider gathers and orga- 
nizes the pertinent information, converts it to the 
HTML scripting format, and makes it available on a 
Web server. The information consumer, after obtain- 
ing initial access to the Web server (as one might tune 
into the correct television station), can then browse 
and search for various tvpes of information available 
on that server. The linking capability of URL and 
HTML allows the references or links to additional 
information on various servers to be easily published 
along with the original information. 

In contrast, multiple information providers work 
in collaboration to generate the content of shared 
information. For the purposes of this paper, we will 
assume that there is only one type of user — informa- 
tion collaborators. 

Collaboration and Groupware 

The WWW is useful for many types of collaboration. 
For example, a project team may need to keep track of 
project status and individual progress; people with 
a common interest (e.g., film enthusiasts) may want to 
share and discuss their views on that topic; a customer 
support group may need a svstem to provide on-line 
answers to real -world customer problems; or several 
authors mav wish to work on a document together. 



Today, several computer applications facilitate such 
collaboration. Collectivelv, these applications are 
known as groupware. Lotus Notes is a popular group- 
ware application. Tvpicallv, groupware applications 
support the following capabilities: 

■ Management of a set of users and groups 

■ Storage of shared information in a database (some- 
times with replication capability) 

■ Viewing information stored in the databases by 
means of a graphical interface 

■ Protection of the collaboration environment when 
necessary through authentication and access control 

Groupware systems are built to run in homoge- 
neous client environments, such as the Microsoft 
Windows environment. They rely on specific client- 
server technology, which is often proprietary, to sup- 
port remote operations. 

The popularity- and rapid growth of the Internet and 
the WWW have created an open, univ ersal, and easy-ro- 
program infrastructure that can readily serve several 
groupware functions. Engineers at Digital's Internet 
Software Business Group recognized the potential of 
using the WWW as the underlying infrastructure for 
groupware solutions and at the same time saw that the 
groupware applications available today have features 
that the WWW lacks. Our goal was to add groupware 
features to the WWW to facilitate collaboration. 

We started exploring the idea of using the Internet 
and the W W W for groupware applications in the sum- 
mer of 1994. By the end of that year, we had built 
a prototype that supported the simplified discussion 
(bulletin-board) features ofan internal product known 
as DEC" Notes. 1 This prototype generated considerable 
interest among active DKC Notes users who were 
seeking a similar solution built around an Internet 
infrastructure. Based on their feedback, the prototype 
was redesigned and became a product. 2 

Bv September 1995, we had built several collabora- 
tive applications to run over the WWW. In a workshop 
organized by the World Wide Web Consortium and 
the Massachusetts Institute of Technology, we par- 
ticipated in discussions on how to extend the WWW 
technology to support collaboration. All the work- 
shop participants presented their ideas to the WWW 
Consortium for review.' 5 

Design 

In this section, we summarize our design philosophy 
and discuss the framework and applications dev eloped 
for the AltaVista Forum product. For our design, we 
adopted an object-oriented approach, which meant 
that we would have to modularize the various compo- 
nents for reuse and modification. 
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Design Philosophy 

Our Rindament.il design philosophy required using 
the Internet and its infrastructure as building blocks 
for our collaboration softw are. After years of experi- 
menting and collaborating to develop an open 
process, the Internet developers realized that the 
Internet had reached a state of critical mass. In the case 
of networks and connectivity, reaching critical mass is 
a tremendous impetus for agreeing on a common 
standard. As more and more users access the Internet, 
the need for software development for the Internet 
also increases. In addition, the very nature of the 
Internet demands an open standardization process to 
ensure the long-term viability of a product. 

Our philosophy also included the reuse of existing 
open software as building blocks whenever possible. 
In addition to our choice of building upon the 
Internet and the WWW technology, we selected 
the Tool Command Language (Tel) as the primary 
language for developing most of our application and 
user interface functions.' We also rook advantage of 
the database library in the Berkeley UNIX distribution 
for built-in database support. 5 

Another objective was to make sure our software 
would be easy to port to all the relevant operating 
system platforms. This principle guided our selection 
of components and helped us isolate a small set of plat- 
form-dependent functions into a special library for 
porting the software. 

As stated earlier, we tried to take an object-oriented 
approach w henever possible. The advantages of our 
approach became increasingly apparent as more peo- 
ple became involved with the software development. 
The object-oriented approach made component reuse 
feasible. 

Framework 

Our framework organizes the AltaVista Forum soft- 
ware into two layers: toolkit and applications. The tools 
required to build the applications overlap each other. 
We have used them to build generic applications, 
including a discussion application that supports users 
discussing a set of related topics, much like newsgroups 
do; a calendar application that supports users 1 abilities 
to schedule events on a specific date and at a particular 



time; and a newspaper application that provides a per- 
sonalized news filtering service. We envision that, •ver 
time, the framework we have developed will support 
a number of diverse applications. Figure 1 shows the 
AltaVista Forum toolkit and application layers. 

The toolkit is a combination of both C and Tel code 
that creates the following interface components: 

■ Built-in database. The application uses a built-in 
database to store its object instances. The database 
is a very simple relational model with an object 
hierarchy relationship facility available to those 
applications that need it. The library also provides 
inversions on certain attributes to support fast 
retrieval and sorting based on attribute values. 

■ Built-in indexing and search. An indexing and 
search function complements the database by 
providing a high-speed query facility. For less- 
structured objects, it is often easier to index them 
and look them up using a search tool. 

■ Graphical user interface support. The use of a graph- 
ical user interface insulates applications from hav- 
ing to deal with HTML directly and cope with its 
changes over time. Abstract definitions of user inter- 
face objects also tend to simplify and clarify the code 
and create a more uniform appearance on the screen. 

■ Access control. All applications require some form 
of access control to regulate who can access, create, 
modify, and delete various objects. 

■ Internationalization. An internationalization facil- 
ity gathers strings that appear in the user interface 
into message catalogs for later translation to differ- 
ent languages. 

■ Platform-specific support. A special library isolates 
those operating system — dependent functions that 
vary from platform to platform. Certain file system 
accesses and date/time library accesses are exam- 
ples of this component. 

Armed with all the components in the toolkit, an 
AltaVista Forum application consists of a set of func- 
tions, each responding to a different user request. The 
organization of an application is modular. A function 
can call various objects that are defined separately as 
parr of the application, including the following: 
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■ Graphical objects such as definitions of buttons, 
toolbars, various objects that arc part of a form 
(e.g., select boxes, radio buttons, check boxes, text 
boxes), and icons. 

■ Database entries, the definitions of their attributes, 
and default values. 

■ User interface aggregate objects such as forms, 
views, dialogs, and error messages. 

■ Default access control policies, including default 

groups, access rights, and their mappings, to con- 
trol who can access individual forums and what 
actions they can take within them. 

This approach encapsulates the details in low-level 
modules, making the software more readable and 
maintainable. It also makes it easy for different func- 
tions to reuse the objects. 

To further facilitate code sharing, the framework 
also allows applications to inherit a set of functions 
and objects that have been grouped together as a 
pseudoapplication. For example, the access control 
management functions can be grouped into a 
pseudoapplication and certain button and toolbar 
definitions can be grouped into another pseudo- 
application. All applications that need access control 
and the common graphical objects that lend a consis- 
tent "look-and-feel" can inherit those functions and 
objects from pseudoapplications. 

The AltaVista Forum product works in conjunction 
with tin: Web browser and the Web server. The Web 
browser submits requests to the Web server whenever 
the user opens a link. If the link points to a file, then 
the Web server sends the file to the browser, which is 
the normal interaction. The link can also point to pro- 
grams on the server; in this case, the Web server 
invokes the program and then the program responds 
to the user. 

When the link points to the AltaVista Forum, the 
Web server invokes the AltaVista Forum dispatcher 
program through the common gateway interface 
((XII). Based on the information passed along with 
the user request, the dispatcher invokes a specific 
application, which, in turn, calls various tools in the 
toolkit to respond to the user's request. Figure 2 i 1 1 lis 
tratcs the interaction of the AltaVista Forum software 
with the Web browser and server. 



Parameters are passed to the dispatcher from seg- 
ments of the URL. The dispatcher parses the URL into 
the pieces that provide the overall control of the pro- 
gram: ( 1 ) the forum name, (2) the access control area 
name, (3) the message name, and (4) the message 
arguments. 

Each forum is an instance of an application object. 
For example, many discussion forums are available on 
various topics. Each discussion forum has its own name 
at the time of creation; however, the same discussion 
application can be used to manage all the forums. 

An access control area contains a set of forums and 
a common user/group database. An administrator 
group helps administer the user/group database and 
establish overall access control policies for the environ- 
ment. A user registers onlv once with an access control 
area. Based on the access control area location, the 
hypertext server not onlv knows where to find the 
user's credentials for authentication purposes but also 
knows how ro authenticate the user and pass the 
authenticated user identity to the AltaVista Forum 
environment. Given the user identity and the access 
control location, AltaVista Forum software can also 
look up the user profile, check access control, and per- 
form other user-specific functions. 

The message name and message arguments 
then select particular actions to perform within the 
application. 

Generic Applications 

The AltaVista Forum product supplies a set of generic 
applications that make the software immediatelv 
usable. The applications are described in this section. 

User and Group Management and Lookup This appli- 
cation provides an interface for user registration 
(either by the user or by an administrator). Users can 
supply and modify their business card information 
such as phone numbers and e-mail addresses. Users 
can also set certain preference parameters that help 
the AltaVista Forum software tailor its responses 
(e.g., native language and preferred displav formats). 
In addition, groups can be created and modified as a 
set of users. This application also provides the interface 
for listing and searching for user and group informa- 
tion for alJ forums. As discussed earlier, the AltaVista 
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Forum product can support .ill the users that a Web 
server can handle since only one repository of users 
and groups is necessary. 

Community, Team, and Personal Vistas A vista is 
another term for home page, which is a place for the 
user to log in to the WWW. Once in the community 
vista, the user sees a set of public forums and links to 
perform various tasks, e.g., register oneself, look up 
teams or join a team, perform AltaVista Forum admin- 
istrative tasks (if an administrator), and so on. For this 
reason, the community vista is also called the summit. 
In much the same wav, a team vista keeps track of all 
the forums and links tor a group of users, and a per- 
sonal vista performs this function for a single user. 
Both team and personal vistas can own forums that are 
not visible to the public community vista. 

Discussion Much like a bulletin-board discussion 
group or Digital's DEC Notes software, this appli- 
cation permits users to share ideas on a set of related 
topics. Users create topics and replies that form a hier- 
archical tree (also known as threaded topics), providing 
a w av for users to navigate through existing discussions. 
Other methods of reading the existing discussion are 
also provided. These include chronologically navigat- 
ing through items not read; listing unread items only 
and selectively reading them; and searching for topics 
and replies containing certain words that were entered 
during a particular time period by a certain author. 
Users can also create multiple discussion forums to 
discuss different topics; this is true for the following 
applications as well. 

Document Sharing The document sharing applica- 
tion enables users to organize documents of the same 
type into hierarchically organized folders. In addition, 
it keeps track of versions of the documents, attach- 
ments, and comments. As with the discussion applica- 
tion, users can browse through and search for specific 
documents using a variety of methods. 

Newspaper The newspaper application lets users 
select a specific source of information and then define 
filters to present onlv those items of potential interest. 
A good example of an information source on the 
Internet is one of the real-time news feeds. Using the 
newspaper application, it is also possible to read and 
monitor other information sources, e.g., e-mail sent to 
a distribution list or information appearing on a set of 
WWW sites. 

Calendar The calendar application permits users 
to enter a set of scheduled events (or a to-do list) and 
present the events as a calendar (sometimes called 
a diarv). The application supports requests to add 
items to the calendar, thus allowing the calendar to be 



used as a scheduling tool. Although a calendar forum 
can be set up for each person, it is equallv useful 
to have a team calendar, a community calendar, or 
even a calendar for a specific type of event. 

Experiences 

In this section, we summarize some of our experiences 
and discuss the lessons we learned along the way. As 
a result of our decision to rely on the Web browser 
as the universal user interface, we had to resolve some 
unique user interface issues. Because we chose to use 
Tel for dev eloping higher-level objects, we had to cope 
with using an interpretive language. We designed the 
database and indexing and search interfaces based 
on extensibility and portability goals. Finally, in the 
design of access control, we had to carefully weigh 
the pros and cons of simplicity and flexibility. 

Coping with the User Interface Defined in HTML 

Very early in the design phase, we decided to make 
AltaVista Forum client-independent, with the excep- 
tion of dependence on the Web browser. This decision 
w as based on the fact that the Web browser was already 
freely available on most of the platforms. We expected 
the browser to become a ubiquitous network front 
end, allowing us to focus on building groupware func- 
tions on the server. This meant that we were faced with 
the task of designing the user interface using HTML.'' 
Since HTML was evolving, our first step was to 
define graphical objects in more abstract constructs 
supported by our toolkit. Each construct encapsulates 
the specifics into a representation of a graphical artifact 
in HTML in the toolkit. Thus as HTML evolves, or as 
the page design changes, onlv one area needs to be 
updated. For example, a select box object on a form 
may be defined as follows: 

forum selectbox s. language \ 
-mapto language \ 
-label "Select a language:" \ 
-labelbreak 

s. language add_option English 1 selected 
s. language add_option French 2 

In this example, a select box is defined to begin with 
a label and some spacing and then to contain two 
options: Fmglish and French, with English as the 
default. The values 1 and 2 are internal representations 
of the selected values. AJso, the "map-to" switch spec- 
ifies that this object must correspond to the language 
attribute in the database, a feature that was included to 
simplify database update. 

Note that although a label is specified, no specifica- 
tion is provided to represent that label in a particular 
font or tvpeface. Neither is the actual spacing for label 
break specified. These decisions are made in the forum 
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select box part of the toolkit procedure, which trans- 
lates this object into HTML. 

Most of the early Web browsers were single-window 
based. This limitation was especially problematic for us 
because most of our applications provide some organi- 
zation to the information content. A much more nat- 
ural way of browsing for our environment would 
include at least two windows: one showing the context 
and the other showing the content of a specific item. 
For this reason, we introduced multiple navigational 
methods. For example, the discussion application 

■ Allows hierarchical navigation (previous, next, up) 

■ AJIows navigation in chronological order (next 
unseen, what's new) 

■ Provides a category view that lists topics according 
to their category 

■ Supports content-based search or an index-like 
function 

Newer versions of Web browsers support frames, 
which have multiple window-browsing capabilities 
(although the standards in this area are still a bit 
vague). We are updating our applications to take 
advantage of these new features. 

Usability studies guided our decisions as we were 
designing forms and dialog boxes. It is likely that 
many potential users of our product are familiar with 
Windows-style user interface objects. Because the 
early Web browsers (e.g., Mosaic) were UNIX-based, 
little attention was given to providing a human- 
computer interface that resembled the more widely 
used Windows interface. However, our usability 
studies indicated that many personal computer (PC) 
users had difficult)' using Web browsers out-of-thc- 
box. For example, a user might expect a dialog box to 
have certain standard buttons, such as OK, cancel, and 
clear. Ideally, the user would know what to do with 
these buttons without any training. To make our soft- 
ware easy to learn, we tried to follow the same user 
interface style that was already familiar to most users. 
Since we were limited by HTML and browser design, 
this was not a simple task. Thus we were often forced 
to produce rough facsimiles of the more well-known 
interface artifacts. 

In summary, we found usability studies to be 
extremely valuable when designing end-user applica- 
tions. For this reason, it is important to allocate 
enough time in the product design cycle to collect user 
feedback before beginning product development. 

The Pros and Cons of Using an Interpretive Language 

As mentioned earlier, we selected Tel as the language 
for building the AltaVista Forum toolkit. Tel is a 
highly portable, extensible, and freely available lan- 
guage that was originally designed to be embedded in 
a larger framework. 4 However, it is also an interpretive 



language, which supported our goal of rapid and 
iterative development of collaborative applications 
for the WWW. 

We extended standard Tel to provide a set of com- 
mands and objects that formed the AltaVista Forum 
toolkit: database, HTML generation, access control, 
internationalization, user profile management, and 
platform-specific support. Many of these extensions 
supported an object-based environment (i.e., the 
environment supported standard Tel objects and our 
simple inheritance mechanism). The use of these 
extensions made it easier to develop applications than 
it would have been with Tel (or any other language), 
alone. As a result, these extensions form the basis for 
future development tools. 

From the beginning, we knew that the choice of 
an interpretive language was going to involve trade- 
offs. In fact, performance, which was our most critical 
trade-off, continues to be a concern for the engineer- 
ing team. Although the performance of an interpreted 
language is lower than that of a compiled language, 
fast processors have made the use of an interpreter 
worthwhile because of the reduced expense of devel- 
oping applications. The use of Tel in the AltaVista 
Forum software certainly takes advantage of this. 
Although the applications and part of the toolkit are 
written in Tel, many critical parts are implemented in 
a compiled language (such as C) to stay within per- 
formance requirements. The engineering team is con- 
tinually searching for ways to improve performance 
while accommodating requests for new features and 
tracking the rapidly evolving WWW environment. 

The second trade-off was the absence of a sophisti- 
cated debugging and profiling environment. Partly 
due to the limitations of Tel and partly due to the 
stateless nature of WWW transactions, some of 
the more sophisticated development tools that pro- 
grammers expect to see are not readily available. 
Despite these shortcomings, rapid development is still 
possible; however, we expect even larger gains as we 
correct these problems in the future. 

Interfacing to the Database 

Several factors (primarily portability and cost) influ- 
enced our decision to build a hybrid database rather 
than the more customary relational database. The 
database in the AltaVista Forum toolltit consists of a 
B tree indexed file (from the Berkeley ndbm package) 
for storage of basic attributes about documents, which 
is backed by the file system for the nonstructured data. 
This design, combined with the search engine 
(described in the next section), is quite effective for 
the types of applications we initially developed with 
the AltaVista Forum toolkit. 

In effect, the database is organized as a collection of 
documents (or entries) that have unique identifiers 
(document IDs), hierarchical document numbers, and 
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a set of attributes that is similar to a relational database 
table. The toolkit provides each entry with a set of 
built-in attributes (such as title, creation and modi- 
fication dates, and author). The applications can then 
deliver additional attributes. 

The toolkit provides the means to retrieve, modify, 
and iterate through the collection of entries in a 
straightforward manner. Because the attributes are 
part of the application description and are not stored 
in a separate database, the toolkit can use its knowl- 
edge of the attributes to simplify certain common 
operations. For example, because transferring data 
from HTML forms to the database and back is a basic 
operation in collaborative applications, the toolkit can 
link fields on forms to database attributes, making it 
possible to store them with a single command. To sup- 
port a dynamic development environment, the toolkit 
also upgrades databases in real time as new attributes 
are added or deleted. This permits the application 
developer to concentrate on the task at hand rather 
than worrv about database management tasks. 

Although the primary organization mechanism is a 
flat table indexed by document identifiers, the database 
integrates a hierarchical relationship between entries 
when necessarv. Because hierarchies are common in 
collaborative applications (e.g., folders/documents 
and topics/replies), it was important to reflect this in 
a natural way in the database. 

In addition to attributes, the database offers proper- 
ties. Compared to attributes, which are stored for each 
entry in the database, properties are stored within 
each forum. Application designers can use these prop- 
erties in any way they desire: they are simple key-value 
relationships. The AltaVista Forum software uses 
properties to implement a variety of features, from 
access control policies to the background color of the 
screen display. 

User properties are an extension of standard forum 
properties. They act like forum properties except that 
they arc tied to the user who is executing the transac- 
tion. User properties keep database locking to a mini- 
mum because, in collaborative applications, a user will 
typically execute only one transaction at a time. 

Indexing and Search: The Way of the Future? 

One key design decision was to include an indexing 
and search engine as a basic component of the prod- 
uct. Although the database is often the central piece of 
a groupware product, an indexing and search engine 
often plays a similar role for a WWW site. This devel- 
opment is completely consistent with the philosophy 
of the WWW — information is linked as needed, not 
necessarily following any structure. Database use is 
more suitable for information objects that have some 
uniformity in their definitions. 

The basic function of the indexing engine is to map 
a set of words to a document containing those words. 



(The term document is used in a generic sense. It can 
be any logical entity associated with a set or words.) 
The indexing information must be stored in such a 
way that subsequent searches based on individual 
words (and phrases) are efficient and speedy. The 
indexing engine in the AltaVista Forum toolkit is 
basically the same indexing engine available on the 
AltaVista Web site. 7 Designed and implemented at 
Digital's System Research Center, it is highly scalable 
and efficient. 

The built-in database functions as a repository for 
entries with a predefined set of attributes. It provides 
fast retrieval when the entries are identified using either 
an entry ID or a hierarchical ID, and it provides simple 
creating, updating, and sorting functions associated 
with retrieval. The indexing and search engine comple- 
ments the AltaVista Forum database: it provides a 
content-based search method and functions at higher 
speed. Since the search engine is extremely fast and 
scalable, we also use it to index some of the attribute 
values in the database. This allows us to use the search 
engine for certain compute-intensive searches that 
otherwise would be performed by the database. 

Based on our experience, we expect the capabilities 
of the indexing and search engine to continue to 
expand. As the popularity of the WWW technology 
continues to grow, the volume of published informa- 
tion will also increase. Only a small amount of this 
information can be effectively captured in databases. 
The indexing and search engine is an invaluable tool 
tor mining useful information out of the vast amount 
of data stored in these databases. 

The Dilemma of Access Control 

Designing access control is very challenging because 
users and administrators have different requirements. 
On the one hand, administrators want a high degree 
of flexibility in controlling access. Their issues include 
the following: 

■ What type of information is subject to access control? 

■ Should access control be defined for ev ery possible 
access/action type? 

■ Should there be arbitrary flexibility in defining 
groups (including nesting)? 

On the other hand, users have stated that they do 
not like products in which access control operations 
are complex, especially in the case of a product that 
is supposed to help people collaborate. In a majority 
of scenarios, they argue that very little access control 
is needed. 

For this reason, we tried to strike a balance between 
administrators' needs and users' preferences. Although 
we recognize the importance of access control, we did 
not give it precedence over product usability. Since 
usability was our priority, and the time available to 
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work on it was limited, we divided our efforts between 
making access control flexible and choosing default 
options that would promote collaboration. 

We defined access control for the whole database 
(forum), rather than for individual entries and attrib- 
utes of entries. However, some entry-level access con- 
trol is necessary. For example, it is preferable to let only 
the owner (or the creator) of an entry modify and 
delete that entry. As a result, we allowed the group def- 
inition to include entry-specific logical users, rather 
than provide a general mechanism for entry-level access 
control. Therefore, a group may contain a member 
who is the owner of the current entry. During access 
control checking, the current entry's owner is looked 
up and matched against the currently logged-in user. 

Instead of letting the administrator define access 
control for each possible incoming access/action, our 
framework allows the application definition to group 
accesses together into logical access rights. For exam- 
ple, for the discussion application, we defined the fol- 
lowing access rights: 

■ Read — Includes all read URLs (different views, 
whether for a single entry or a list of entries) 

■ Contribute — Includes adding a topic or reply 

■ Modify — Includes any form of modification or 
deletion 

■ Moderate — Includes such functions as creating key- 
words, polling options, controlling number of levels 
of replies, and setting certain entries as hidden 

■ Administrate — Change access control or other 
kinds of resource consumption policies 

By defining these access rights, the administrator only 
needs to establish who can do these five operations, 
rather than define numerous other kinds of opera- 
tions. It is still possible to change and add to this 
group of access rights by making simple modifications 
to the application definition. 

Our basic strategy for making access control easy to 
manage is to set up default policies of access control 
that apply to as many situations as possible, within rea- 
son. The default policy is added to the application def- 
inition. If the administrator is satisfied with the default 
policies, then the access control can be used as sup- 
plied. For the discussion application, the default policy 
is the following: 

■ Read — All users, including anonymous 

■ Contribute — All users, excluding anonymous 

■ Modify — Owner (creator) of entry and moderators 

■ Moderate — Owner of the forum 

■ Administrate — Owner of the forum 

To simplify implementation, we chose not to allow 
nesting of groups. Our design allows for adding it in 



the future as long as it makes management of access 
control policies easier. 

Future Directions 

To date, we have received encouraging feedback from 
users. Of the ways that we can continue to improve 
the AltaVista Forum product, we feel the following 
deserve the highest priority. 

First, we need to provide better ways to help users 
deal with information overflow. Although we have 
built ways to filter and search information into our 
application, further simplification is necessary. We 
are working on smart agents that bring the relevant 
information to the user's fingertips. 

Second, a number of the functions that we provide 
can be more easily performed on the client machine. 
The Java language is the best candidate for providing 
these functions since it enables us to handle a wide 
variety of client platforms. Initially, we are looking into 
using Java to improve certain user interface problems, 
such as opening additional windows on the client 
machine to notify users of new information. 

Third, synchronous collaboration using video, 
audio, and whiteboard will soon become feasible and 
cost effective. It is important for us to help bring users 
together through both synchronous and asynchro- 
nous methods of collaboration. For example, users 
should be able to use the calendar application to 
schedule a meeting over the Internet, and Windows 
should be available to the user automatically. 

Fourth, as the AltaVista Forum software matures, 
we hope to add to its performance and increase 
its scalability. As its environment evolves, we are look- 
ing into ways to bypass the CGI interface and use a 
compiled language for more of the toolkit implemen- 
tation. We also hope to add support for large commer- 
cial databases. 

Finally, we will continue to add innovative applica- 
tions to our product. We recently built a prototype of 
a customer-support application that keeps track of 
problem reporting. We are looking into other applica- 
tions such as project management, group review, and 
survey and decision-support systems. 
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Recent Digital 
U.S. Patents 



The following patents were recently issued to Digital 
Equipment Corporation. Titles and names supplied 
to us by the U.S. Patent and Trademark Office are 
reproduced as they appear on the original published 
patent. 



0353,800 M. S. Lew is, L. A. Treseder, R. M. Tuslcr, 
and G. Suzda 

5,371 ,807 N. Kannan and M. S. 

5,37 1 ,822 F. Horw itz and E. Thomson 

5,371,868 G. P. Konig, H. S. Yang, and VV. Hawe 



5,371 ,870 P. M. Goodwin, D. Smclser, and 
D. A. Tatosian 

5,371 ,874 M. Gagliardo, J. Lynch, K. Chinnaswamy, 
and J. Tcssari 

5,371,889 J. Klein 

5,372,262 J. M. Benson and J. E. Frirscher 

5,373,42 1 C. Detsikas and T Spcllman 

5,375,068 H. S, Palmer and L. G. Palmer 

5,375,199 J. R. Harrow and ¥. P. Messmger 



5,377,190 H Yang, K. K. Ramakrishnan, B. Spinney, 
and K. R. Jam 

5,377,327 K. R. Jain, K. K. Ramakrishnan, and 

n.-M. c:hiu 

5,377,354 N. ScannelJ, A. Redmond, P, Bares, A. Clark, 
S. Dawson, and S. Himbaut 

5,378,945 H. Partovi, S. Butler, and L. Tran 

5,379,4 19 J. S. Hcffei nan, 1' L. Savage, S. J. Pittman, 
and R. \'. Sunkara 

5,381,052 R. Kolte 



5,381,146 R. Kolte 



Modular Enclosure tor Electronic Equipment 



Register Method and Apparatus for Text Classification 

Method of Packaging and Assembling Opro-clcctronie 
Integrated Circuits 

Method and Appar atus for Deriving Addresses for Stored 
Address Information for Use in Identifying Devices during 
Communication 

Stream Hurler Memory Having a Multiple-entry Address 
History Buffer for Detecting Sequential Reads to Initiate 
Prefetching 

VVrite-read/Write-pass Memory Subsystem Cycle 



J cm mailing Optimization System and Method For 
Disrri bu red Computations 

Frame Assembly for Raek-mountable Equipment 

Fiber Optic Transceiv er Mounting Bracket 

Video Teleconferencing for Netw orked Workstations 

System Monitoring Method and Dev ice, Including a 
Graphical User Interface to View and Manipulate Sv stem 
Information 

Frame Remov al Mechanism Using Frame Count for Token 
Ring Networks 

Congestion Avoidance Scheme for Computer Networks 

Method and System tor Sorting and Prioritizing Electronic 
Mail Messages 

Voltage Level Converting Buffer Circuit 

Methods and Apparatus for Accessing Non-relational Data 
Files Using Relational Queries 

Peak Detector Circuit and Application in a Fiber Optic 
Receiver 

Voltage-tracking Circuit and Application in a Track-and- 
hold Amplifier 
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5,382,831 B . Lee, E. Atakov, and J. Clement 

5,383,096 M. C. Benson and L. M. Mazzone 
5,384,779 M. Patrick and J. A. Daly 

5,385,289 C. Bioch, P. iMcKinlcv, and R. Ranganathan 

5,385,630 A. Philipossian, H . Soleimani, and B. Doyle 

5,386,514 V. Boacn, R. Lary,B. Rubinson, D. Thicl, 

C. Van Ingcn, VV. Watson, R. Willard, and 
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5.386.523 N. A. Crook, M. J. Seaman, and 

D. L. A. Brash 

5.386.524 V. Boaen, R. Lary, B. Rubinson, D. Thicl, 

C. Van Ingen, VV. Watson, and R. Willard 

5,387,495 J. C K. Lee, M. Castro, F. Tung, C Lcc, 
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5,387,530 A. Philipossian and B. Dovle 
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5,388,222 L. A. P. Chisvin, J. F. Rantala,]. K. Grooms, 
and D W. Hartwell 

5,388,224 B.Maskas 

5,388,247 P. Goodwin, K. Thaller, and B. Maskas 
5,388,263 R. K. Peterson, J. R. Ellis, and C. G. Nvlander 
5,389,757 E. Soulicrc 

5,390,173 B. Spinney, R. Simcoe, G. Varchese, and 
R. Thomas 

5,390,286 Iv K. Ramakrishnan 

5,390,299 S. L. Rege, K. K. Ramakrishnan, and 

D. A. Gagne 

5,390,302 J. Johnson, M. Howell, and C. Whiraker 
5,390,318 K. K. Ramakrishnan and P. Biswas 

5,390,327 C. Lubbers and D. Thiel 

5,392,219 S. Birch, G. Gavrel, and Z. Memon 
5,394,143 J. Murray and G. Antoshenkov 
5,394,347 R. Kita, S. Tremblay, and T. Lynch 

5,394,401 M . Patrick and J. A. Daly 
5,394,529 J. Brown, J. Meyer, and S. Persels 



Integrated Circuit Metal Film Interconnect Having 
Enhanced Resistance to Electromigrarion 

I/O Expansion Box 

State Machines for Configuration of a Communications 
Network 

Embedded Features for Registration Measurement in 
Electronics Manufacturing 

Process for Increasing Sacrificial Oxide Etch Rate to 
Reduce Field Oxide Loss 

Queue Apparatus and Mechanics for a Communications 
Interface Architecture 

Addressing Scheme for Accessing a Portion of a 
Large Memorv Space 

System for Accessing Information in a Data Processing 
System 

Sequential Multilayer Process for Using Fluorinatcd 
Hydrocarbons as a Dielectric 

Threshold Optimization for SOI Transistors through Use 
of Negative Charge in the Gate Oxide 

Backplane Wiring tor Hub in Packet Data 
Communications System 

Memory Subsystem Input Queue 

Processor Identification Mechanism for a Multiprocessor 
Svstem 

History Buffer Control to Reduce Unnecessary Allocations 
in a Memory Stream Buffer 

Procedure State Descriptor System for Digital Data 
Processors 

Elastomerie Key Switch Actuator 

Packet Format in Hub for Packet Data Communications 
System 

Reticular Discrimination Network for Specifying Real-time 
Conditions 

System for Using Three Different Methods to Report 
Buffer Memory Occupancy Information Regarding 
Fullness-related and/or Packet Discard-related 
Information 

Transaction Control 

Cache Arrangement for File System in Digital Data 
Processing System 

Method for On-line Reorganization of the Data on a 
RAID-4 or RAJD-5 Array in the Absence of One Disk 
and the On-line Restoration of a Replacement Disk 

Determination of Interconnect Stress Test Current 

Run-length Compression of Index Keys 

Method and Apparatus for Generating Tests for Structures 
Expressed as Extended Finite State Machines 

Arrangement for a Token Ring Communications Network 

Branch Prediction Unit for High-performance Processor 
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Call for Papers 
Network Products 
and Technologies 

The Digital Technical Journal seeks technical papers in all areas of networking 
technology for an issue to be published in the fall of 1997. Digital's engineers and 
industry partners interested in participating in the special issue should send topics 
and brief abstracts ( 100 words) by February 10, 1 997, to 

Jane Blake, Managing Editor 
Dig Hal T ?ch n icaljou rnal 
Digital Equipment Corporation 
50 Nagog Park, AK02-3/B3 
Acton, MA 01720-9843 
Email: jane.blake@ljo.dec.com 
508-486-2544 

Notice of the topics accepted will be sent to all authors by February 28, 1997. 
The manuscript-submission date for accepted topics is May 30, 1997. 

For information on topics published in the journal, the audience, writing guide- 
lines, and the peer-review process, see http://www.digital.com/info/dtj/ 
dtj-guide.htm or contact the Managing Editor at the address above. 
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